Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints

Sándor Darányi, Peter Wittek, Milena Dobreva

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
Original languageEnglish
Pages (from-to)3-12
Number of pages10
JournalInternational Journal on Digital Libraries
Volume12
Issue number1
Early online date27 Jan 2012
DOIs
Publication statusPublished - 1 Jul 2012

Keywords

  • digital libraries
  • text categorization
  • machine learning
  • support vector machines
  • analogical information representation
  • wavelet analysis

Fingerprint

Dive into the research topics of 'Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints'. Together they form a unique fingerprint.

Cite this