Abstract
Identifying child-appropriate web content is an important
yet difficult classification task. This novel task is characterised by attempting
to determine age/child appropriateness (which is not necessarily
topic-based), despite the presence of unbalanced class sizes and the
lack of quality training data with human judgements of appropriateness.
Classification of feeds, a subset of web content, presents further challenges
due to their temporal nature and short document format. In this
paper, we discuss these challenges and present baseline results for this
task through an empirical study that classifies incoming news stories as
appropriate (or not) for children. We show that while the na¨ıve Bayes
approach produces a higher AUC it is vulnerable to the imbalanced data
problem, and that support vector machine provides a more robust overall
solution. Our research shows that classifying children’s content is a
non-trivial task that has greater complexities than standard text based
classification. While the F-score values are consistent with other research
examining age-appropriate text classification, we introduce a new problem
with a new dataset.
yet difficult classification task. This novel task is characterised by attempting
to determine age/child appropriateness (which is not necessarily
topic-based), despite the presence of unbalanced class sizes and the
lack of quality training data with human judgements of appropriateness.
Classification of feeds, a subset of web content, presents further challenges
due to their temporal nature and short document format. In this
paper, we discuss these challenges and present baseline results for this
task through an empirical study that classifies incoming news stories as
appropriate (or not) for children. We show that while the na¨ıve Bayes
approach produces a higher AUC it is vulnerable to the imbalanced data
problem, and that support vector machine provides a more robust overall
solution. Our research shows that classifying children’s content is a
non-trivial task that has greater complexities than standard text based
classification. While the F-score values are consistent with other research
examining age-appropriate text classification, we introduce a new problem
with a new dataset.
Original language | English |
---|---|
Title of host publication | Proceedings of the 34th European Conference on Advances in Information Retrieval |
Place of Publication | Berlin, Heidelberg |
Publisher | Springer-Verlag |
Pages | 63-72 |
Number of pages | 10 |
ISBN (Print) | 978-3-642-28996-5 |
DOIs | |
Publication status | Published - 2012 |
Publication series
Name | ECIR'12 |
---|---|
Publisher | Springer-Verlag |
Keywords
- children
- news feeds
- web content
- classification