Abstract
Digital libraries increasingly benefit from research on automated text categorization for improved access. Such research is typically carried out by means of standard test collections. In this article, we present a pilot experiment of replacing such test collections by a set of 6,000 objects from a real-world digital repository, indexed by Library of Congress Subject Headings, and test support vector machines in a supervised learning setting for their ability to reproduce the existing classification. To augment the standard approach, we introduce a combination of two novel elements: using functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classification reconstruction from abstracts and vice versa from full-text documents, the latter outcome being due to word sense ambiguity. The practical implementation of our methodological framework enhances the analysis and representation of specific knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of specific knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital objects and collections). Our research is an initial step in this direction developing further the methodological approach and demonstrating that text categorization can be applied to analyse the thematic coverage in digital repositories.
Original language | English |
---|---|
Pages (from-to) | 3-12 |
Number of pages | 10 |
Journal | International Journal on Digital Libraries |
Volume | 12 |
Issue number | 1 |
Early online date | 27 Jan 2012 |
DOIs | |
Publication status | Published - 1 Jul 2012 |
Keywords
- digital libraries
- text categorization
- machine learning
- support vector machines
- analogical information representation
- wavelet analysis