Improving access to large patent corpora

Richard Bache, Leif Azzopardi

Research output: Chapter in Book/Report/Conference proceedingOther chapter contribution


Retrievability is a measure of access that quantifies how easily documents can be found using a retrieval system. Such a measure is of particular interest within the patent domain, because if a retrieval system makes some patents hard to find, then patent searchers will have a difficult time retrieving these patents. This may mean that a patent searcher could miss important and relevant patents because of the retrieval system. In this paper, we describe measures of retrievability and how they can be applied to measure the overall access to a collection given a retrieval system. We then identify three features of best-match retrieval models that are hypothesized to lead to an improvement in access to all documents in the collection: sensitivity to term frequency, length normalization and convexity. Since patent searchers tend to favour Boolean models over best-match models, hybrid retrieval models are proposed that incorporate these features while preserving the desirable aspects of the traditional Boolean model. An empirical study conducted on four large patent corpora demonstrates that these hybrid models provide better access to the corpus of patents that the traditional Boolean model.
Original languageEnglish
Title of host publicationTransactions on Large-Scale Data- and Knowledge-Centered Systems II
EditorsAbdelkader Hameurlain, Josef Küng, Roland Wagner
Place of PublicationBerlin, Heidelberg
Number of pages19
ISBN (Print)978-3-642-16174-2
Publication statusPublished - 2010
Externally publishedYes

Publication series

NameTransactions on Large-Scale Data- and Knowledge-Centered Systems
PublisherSpringer Verlag


  • patent searching
  • information retrieval
  • Boolean searches

Cite this