Projects per year
Abstract
Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.
Original language | English |
---|---|
Title of host publication | Advances in Information Retrieval |
Subtitle of host publication | ECIR 2022 |
Editors | Matthias Hagen, Suzan Verberne, Craig MacDonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, Vinay Setty |
Pages | 265-269 |
Number of pages | 5 |
DOIs | |
Publication status | Published - 5 Apr 2022 |
Externally published | Yes |
Publication series
Name | Lecture Notes in Computer Science |
---|---|
Publisher | Springer |
Volume | 13186 |
ISSN (Print) | 0302-9743 |
ISSN (Electronic) | 1611-3349 |
Keywords
- digitisation
- optical character recognition software
- historical
- newspaper history
Fingerprint
Dive into the research topics of 'DuoSearch: A Novel Search Engine for Bulgarian Historical Documents'. Together they form a unique fingerprint.Projects
- 1 Finished
-
DISTILL: DISruptive Technologies in Innovation Labs for digital cultural heritage
Dobreva, M. (Principal Investigator)
18/01/21 → 31/12/22
Project: Projects from Previous Employment