Revisiting the relationship between document length and relevance

David E. Losada, Leif Azzopardi, Mark Baillie

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

6 Citations (Scopus)

Abstract

The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.
LanguageEnglish
Title of host publicationCIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management
Place of PublicationNew York, NY, USA
Pages419-428
Number of pages10
DOIs
Publication statusPublished - 26 Oct 2008

Fingerprint

trend
system comparison
evidence
information retrieval
artifact

Keywords

  • document length
  • relevance
  • pooling
  • information retrieval

Cite this

Losada, D. E., Azzopardi, L., & Baillie, M. (2008). Revisiting the relationship between document length and relevance. In CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management (pp. 419-428). New York, NY, USA. https://doi.org/10.1145/1458082.1458139
Losada, David E. ; Azzopardi, Leif ; Baillie, Mark. / Revisiting the relationship between document length and relevance. CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY, USA, 2008. pp. 419-428
@inproceedings{3ca65c839df34c5791837ef0ec95d00b,
title = "Revisiting the relationship between document length and relevance",
abstract = "The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.",
keywords = "document length, relevance, pooling, information retrieval",
author = "Losada, {David E.} and Leif Azzopardi and Mark Baillie",
year = "2008",
month = "10",
day = "26",
doi = "10.1145/1458082.1458139",
language = "English",
isbn = "978-1-59593-991-3",
pages = "419--428",
booktitle = "CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management",

}

Losada, DE, Azzopardi, L & Baillie, M 2008, Revisiting the relationship between document length and relevance. in CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY, USA, pp. 419-428. https://doi.org/10.1145/1458082.1458139

Revisiting the relationship between document length and relevance. / Losada, David E.; Azzopardi, Leif; Baillie, Mark.

CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY, USA, 2008. p. 419-428.

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - Revisiting the relationship between document length and relevance

AU - Losada, David E.

AU - Azzopardi, Leif

AU - Baillie, Mark

PY - 2008/10/26

Y1 - 2008/10/26

N2 - The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.

AB - The scope hypothesis in Information Retrieval (IR) states that a relationship exists between document length and relevance, such that the likelihood of relevance increases with document length. A number of empirical studies have provided statistical evidence supporting the scope hypothesis. However, these studies make the implicit assumption that modern test collections are complete (i.e. all documents are assessed for relevance). As a consequence the observed evidence is misleading. In this paper we perform a deeper analysis of document length and relevance taking into account that test collections are incomplete. We first demonstrate that previous evidence supporting the scope hypothesis was an artefact of the test collection, where there is a bias towards longer documents in the pooling process. We evaluate whether this length bias affects system comparison when using incomplete test collections. The results indicate that test collections are problematic when considering MAP as a measure of effectiveness but are relatively robust when using bpref. The implications of the study indicate that retrieval models should not be tuned to favour longer documents, and that designers of new test collections should take measures against length bias during the pooling process in order to create more reliable and robust test collections.

KW - document length

KW - relevance

KW - pooling

KW - information retrieval

U2 - 10.1145/1458082.1458139

DO - 10.1145/1458082.1458139

M3 - Conference contribution book

SN - 978-1-59593-991-3

SP - 419

EP - 428

BT - CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management

CY - New York, NY, USA

ER -

Losada DE, Azzopardi L, Baillie M. Revisiting the relationship between document length and relevance. In CIKM '08 Proceedings of the 17th ACM Conference on Information and Knowledge Management. New York, NY, USA. 2008. p. 419-428 https://doi.org/10.1145/1458082.1458139