Template mining for information extraction from digital documents

Research output: Contribution to journalArticlepeer-review

20 Citations (Scopus)

Abstract

With the rapid growth of digital information resources, information extraction (IE) - the process of automatically extracting information from natural language texts - is becoming more important. A number of IE systems, particularly in the areas of news/fact retrieval and in domain-specific areas, such as in chemical and patent information retrieval, have been developed in the recent past using the template mining approach that involves a natural language processing (NLP) technique to extract data directly from text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template. This article briefly reviews template mining research. It also shows how templates are used in Web search engines - such as Alta Vista - and in meta-search engines - such as Ask Jeeves - for helping end-users generate natural language search expressions. Some potential areas of application of template mining for extraction of different kinds of information from digital documents are highlighted, and how such applications are used are indicated. It is suggested that, in order to facilitate template mining standardization in the presentation, and layout of information within digital documents has to be ensured, and this can be done by generating various templates that authors can easily download and use while preparing digital documents.

Original languageEnglish
Pages (from-to)182-208
Number of pages27
JournalLibrary Trends
Volume48
Issue number1
Publication statusPublished - 30 Jun 1999

Keywords

  • template mining
  • information extraction
  • digital documents
  • natural language processing

Fingerprint

Dive into the research topics of 'Template mining for information extraction from digital documents'. Together they form a unique fingerprint.

Cite this