A multi-collection latent topic model for federated search

M. Baillie, M. Carman, F. Crestani

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Collection selection is a crucial function, central to the effectiveness and
efficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriate
terms from topically related samples, thereby dealing with the problem of missing
vocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.
Original languageEnglish
Pages (from-to)390-412
Number of pages23
JournalInformation Retrieval
Volume14
DOIs
Publication statusPublished - 2011

Fingerprint

Information retrieval systems
information retrieval
Sampling
experiment
resources
Experiments

Keywords

  • collection selection
  • information rtrieval
  • databases
  • distributed information retrieval
  • topic models

Cite this

Baillie, M. ; Carman, M. ; Crestani, F. / A multi-collection latent topic model for federated search. In: Information Retrieval. 2011 ; Vol. 14. pp. 390-412.
@article{ef11a9bf9a18440283ff615a4828e700,
title = "A multi-collection latent topic model for federated search",
abstract = "Collection selection is a crucial function, central to the effectiveness andefficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriateterms from topically related samples, thereby dealing with the problem of missingvocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.",
keywords = "collection selection, information rtrieval, databases, distributed information retrieval, topic models",
author = "M. Baillie and M. Carman and F. Crestani",
year = "2011",
doi = "10.1007/s10791-010-9147-3",
language = "English",
volume = "14",
pages = "390--412",
journal = "Information Retrieval Journal",
issn = "1386-4564",
publisher = "Springer Netherlands",

}

A multi-collection latent topic model for federated search. / Baillie, M.; Carman, M.; Crestani, F.

In: Information Retrieval, Vol. 14, 2011, p. 390-412.

Research output: Contribution to journalArticle

TY - JOUR

T1 - A multi-collection latent topic model for federated search

AU - Baillie, M.

AU - Carman, M.

AU - Crestani, F.

PY - 2011

Y1 - 2011

N2 - Collection selection is a crucial function, central to the effectiveness andefficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriateterms from topically related samples, thereby dealing with the problem of missingvocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.

AB - Collection selection is a crucial function, central to the effectiveness andefficiency of a federated information retrieval system. A variety of solutions have been proposed for collection selection adapting proven techniques used in centralised retrieval. This paper defines a new approach to collection selection that models the topical distribution in each collection. We describe an extended version of latent Dirichlet allocation that uses a hierarchical hyperprior to enable the different topical distributions found in each collection to be modelled. Under the model, resources are ranked based on the topical relationship between query and collection. By modelling collections in a low dimensional topic space, we can implicitly smooth their term-based characterisation with appropriateterms from topically related samples, thereby dealing with the problem of missingvocabulary within the samples. An important advantage of adopting this hierarchical model over current approaches is that the model generalises well to unseen documents given small samples of each collection. The latent structure of each collection can therefore be estimated well despite imperfect information for each collection such as sampled documents obtained through query-based sampling. Experiments demonstrate that this new, fully integrated topical model is more robust than current state of the art collection selection algorithms.

KW - collection selection

KW - information rtrieval

KW - databases

KW - distributed information retrieval

KW - topic models

UR - http://www.scopus.com/inward/record.url?scp=79960589105&partnerID=8YFLogxK

U2 - 10.1007/s10791-010-9147-3

DO - 10.1007/s10791-010-9147-3

M3 - Article

VL - 14

SP - 390

EP - 412

JO - Information Retrieval Journal

JF - Information Retrieval Journal

SN - 1386-4564

ER -