DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

Angel Beshirov, Suzan Hadzhieva, Ivan Koychev, Milena Dobreva

Research output: Chapter in Book/Report/Conference proceedingChapter

1 Citation (Scopus)

Abstract

Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.

Original languageEnglish
Title of host publicationAdvances in Information Retrieval
Subtitle of host publicationECIR 2022
EditorsMatthias Hagen, Suzan Verberne, Craig MacDonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, Vinay Setty
Pages265-269
Number of pages5
DOIs
Publication statusPublished - 5 Apr 2022
Externally publishedYes

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume13186
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Keywords

  • digitisation
  • optical character recognition software
  • historical
  • newspaper history

Fingerprint

Dive into the research topics of 'DuoSearch: A Novel Search Engine for Bulgarian Historical Documents'. Together they form a unique fingerprint.

Cite this