Automated extraction of data from text using an xml parser: an earth science example using fossil descriptions

G.B. Curry, R.C.H. Connor

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.
LanguageEnglish
Pages159-169
Number of pages10
JournalGeosphere
Volume4
Issue number1
DOIs
Publication statusPublished - Jan 2008

Fingerprint

Earth science
fossil
World Wide Web
tagging
nomenclature
industry

Keywords

  • geoinformatics
  • data acquisition
  • XML parsing
  • taxonomy
  • databases

Cite this

@article{902298b7cb0d4b58afe05fae829282d7,
title = "Automated extraction of data from text using an xml parser: an earth science example using fossil descriptions",
abstract = "Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.",
keywords = "geoinformatics, data acquisition, XML parsing, taxonomy, databases",
author = "G.B. Curry and R.C.H. Connor",
year = "2008",
month = "1",
doi = "10.1130/GES00140.1",
language = "English",
volume = "4",
pages = "159--169",
journal = "Geosphere",
issn = "1553-040X",
number = "1",

}

Automated extraction of data from text using an xml parser: an earth science example using fossil descriptions. / Curry, G.B.; Connor, R.C.H.

In: Geosphere, Vol. 4, No. 1, 01.2008, p. 159-169.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Automated extraction of data from text using an xml parser: an earth science example using fossil descriptions

AU - Curry, G.B.

AU - Connor, R.C.H.

PY - 2008/1

Y1 - 2008/1

N2 - Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.

AB - Many valuable earth science data are not available in a digital format. Manual entry of such information into databases is time consuming, unrewarding, and prone to introducing errors. Taxonomic descriptions of fossils are a good example of valuable data that are overwhelming and available only in printed volumes and journals, some of which are increasingly rare and inaccessible. The highly structured nature of taxonomic procedures and nomenclature means that many previously published data remain equally valid to the present day, and contain information that is currently not available on the World Wide Web; these data would be of great use to a wide variety of scientists and other end users in government, industry, academia and the general public. This paper describes an XML (extensible markup language) parsing technique that allows taxonomic descriptions to be fully digitized much more rapidly than would be possible by manual entry of the data into a database. The technique exploits the high degree of structure in taxonomic descriptions, which are written in a standardized format, to automate the processing of tagging separate sections of the text. Once tagged using XML, the data can be subjected to complex searches using queries written in any of the XML query standards. The XML-tagged data can potentially be imported into existing databases, in effect removing the necessity to manually enter the information, and hence overcoming the main bottleneck in generating digital data from printed material. Individual parsers can be tailored precisely to the nature of the text being analyzed, and once the underlying concepts and procedures are understood, those interested in acquiring and using digital data will be able to generate XML parsers dedicated to text with different styles of standardized formatting.

KW - geoinformatics

KW - data acquisition

KW - XML parsing

KW - taxonomy

KW - databases

UR - http://dx.doi.org/10.1130/GES00140.1

U2 - 10.1130/GES00140.1

DO - 10.1130/GES00140.1

M3 - Article

VL - 4

SP - 159

EP - 169

JO - Geosphere

T2 - Geosphere

JF - Geosphere

SN - 1553-040X

IS - 1

ER -