Projects per year
Abstract
Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.
| Original language | English |
|---|---|
| Title of host publication | Advances in Information Retrieval |
| Subtitle of host publication | ECIR 2022 |
| Editors | Matthias Hagen, Suzan Verberne, Craig MacDonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, Vinay Setty |
| Pages | 265-269 |
| Number of pages | 5 |
| DOIs | |
| Publication status | Published - 5 Apr 2022 |
| Externally published | Yes |
Publication series
| Name | Lecture Notes in Computer Science |
|---|---|
| Publisher | Springer |
| Volume | 13186 |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Keywords
- digitisation
- optical character recognition software
- historical
- newspaper history
Fingerprint
Dive into the research topics of 'DuoSearch: A Novel Search Engine for Bulgarian Historical Documents'. Together they form a unique fingerprint.Projects
- 1 Finished
-
DISTILL: DISruptive Technologies in Innovation Labs for digital cultural heritage
Dobreva, M. (Principal Investigator)
18/01/21 → 31/12/22
Project: Projects from Previous Employment