Assessing multivariate Bernoulli models for information retrieval

David E. Losada, Leif Azzopardi

Research output: Contribution to journalArticle

20 Citations (Scopus)

Abstract

Although the seminal proposal to introduce language modeling in information retrieval was based on a multivariate Bernoulli model, the predominant modeling approach is now centered on multinomial models. Language modeling for retrieval based on multivariate Bernoulli distributions is seen inefficient and believed less effective than the multinomial model. In this article, we examine the multivariate Bernoulli model with respect to its successor and examine its role in future retrieval systems. In the context of Bayesian learning, these two modeling approaches are described, contrasted, and compared both theoretically and computationally. We show that the query likelihood following a multivariate Bernoulli distribution introduces interesting retrieval features which may be useful for specific retrieval tasks such as sentence retrieval. Then, we address the efficiency aspect and show that algorithms can be designed to perform retrieval efficiently for multivariate Bernoulli models, before performing an empirical comparison to study the behaviorial aspects of the models. A series of comparisons is then conducted on a number of test collections and retrieval tasks to determine the empirical and practical differences between the different models. Our results indicate that for sentence retrieval the multivariate Bernoulli model can significantly outperform the multinomial model. However, for the other tasks the multinomial model provides consistently better performance (and in most cases significantly so). An analysis of the various retrieval characteristics reveals that the multivariate Bernoulli model tends to promote long documents whose nonquery terms are informative. While this is detrimental to the task of document retrieval (documents tend to contain considerable nonquery content), it is valuable for other tasks such as sentence retrieval, where the retrieved elements are very short and focused.
Original languageEnglish
Article number17
Number of pages46
JournalACM Transactions on Information Systems
Volume26
Issue number3
DOIs
Publication statusPublished - 1 Jun 2008
Externally publishedYes

Keywords

  • information retrieval
  • multivariate Bernoulli
  • multinomial
  • language models

Fingerprint Dive into the research topics of 'Assessing multivariate Bernoulli models for information retrieval'. Together they form a unique fingerprint.

Cite this