Classifying suspicious content using frequency analysis

Obika Gellineau, George R. S. Weir

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

66 Downloads (Pure)

Abstract

This paper details an experiment to explore the use of chi by degrees of freedom (CBDF) and Log-Likelihood statistical similarity measures with single word and bigram frequencies as a means of discriminating subject content in order to classify samples of chat texts as dangerous, suspicious or innocent. The control for these comparisons was a set of manually ranked sample texts that were rated, in terms of eleven subject categories (five considered dangerous and six considered harmless). Results from this manual rating of chat text samples were then compared with the ranked lists generated using CBDF and Log-Likelihood measures, for both word and bigram frequency. This was achieved by combining currently available textual analysis tools with a newly implemented software application. Our results show that the CBDF method using word frequencies gave discrimination closest to the human rated samples.
Original languageEnglish
Title of host publicationCorpora and Language Technologies in Teaching, Learning and Research
EditorsGeorge Weir, S. Ishikawa, K. Poonpol
Number of pages8
Publication statusPublished - 2011

Fingerprint

Application programs
Experiments

Keywords

  • CBDF
  • chi by degrees of freedom
  • log-likelihood measures

Cite this

Gellineau, O., & Weir, G. R. S. (2011). Classifying suspicious content using frequency analysis. In G. Weir, S. Ishikawa, & K. Poonpol (Eds.), Corpora and Language Technologies in Teaching, Learning and Research
Gellineau, Obika ; Weir, George R. S. / Classifying suspicious content using frequency analysis. Corpora and Language Technologies in Teaching, Learning and Research. editor / George Weir ; S. Ishikawa ; K. Poonpol. 2011.
@inproceedings{0412e57485cb4e588651631e6abba8c4,
title = "Classifying suspicious content using frequency analysis",
abstract = "This paper details an experiment to explore the use of chi by degrees of freedom (CBDF) and Log-Likelihood statistical similarity measures with single word and bigram frequencies as a means of discriminating subject content in order to classify samples of chat texts as dangerous, suspicious or innocent. The control for these comparisons was a set of manually ranked sample texts that were rated, in terms of eleven subject categories (five considered dangerous and six considered harmless). Results from this manual rating of chat text samples were then compared with the ranked lists generated using CBDF and Log-Likelihood measures, for both word and bigram frequency. This was achieved by combining currently available textual analysis tools with a newly implemented software application. Our results show that the CBDF method using word frequencies gave discrimination closest to the human rated samples.",
keywords = "CBDF, chi by degrees of freedom, log-likelihood measures",
author = "Obika Gellineau and Weir, {George R. S.}",
year = "2011",
language = "English",
isbn = "9780947649821",
editor = "George Weir and S. Ishikawa and K. Poonpol",
booktitle = "Corpora and Language Technologies in Teaching, Learning and Research",

}

Gellineau, O & Weir, GRS 2011, Classifying suspicious content using frequency analysis. in G Weir, S Ishikawa & K Poonpol (eds), Corpora and Language Technologies in Teaching, Learning and Research.

Classifying suspicious content using frequency analysis. / Gellineau, Obika; Weir, George R. S.

Corpora and Language Technologies in Teaching, Learning and Research. ed. / George Weir; S. Ishikawa; K. Poonpol. 2011.

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - Classifying suspicious content using frequency analysis

AU - Gellineau, Obika

AU - Weir, George R. S.

PY - 2011

Y1 - 2011

N2 - This paper details an experiment to explore the use of chi by degrees of freedom (CBDF) and Log-Likelihood statistical similarity measures with single word and bigram frequencies as a means of discriminating subject content in order to classify samples of chat texts as dangerous, suspicious or innocent. The control for these comparisons was a set of manually ranked sample texts that were rated, in terms of eleven subject categories (five considered dangerous and six considered harmless). Results from this manual rating of chat text samples were then compared with the ranked lists generated using CBDF and Log-Likelihood measures, for both word and bigram frequency. This was achieved by combining currently available textual analysis tools with a newly implemented software application. Our results show that the CBDF method using word frequencies gave discrimination closest to the human rated samples.

AB - This paper details an experiment to explore the use of chi by degrees of freedom (CBDF) and Log-Likelihood statistical similarity measures with single word and bigram frequencies as a means of discriminating subject content in order to classify samples of chat texts as dangerous, suspicious or innocent. The control for these comparisons was a set of manually ranked sample texts that were rated, in terms of eleven subject categories (five considered dangerous and six considered harmless). Results from this manual rating of chat text samples were then compared with the ranked lists generated using CBDF and Log-Likelihood measures, for both word and bigram frequency. This was achieved by combining currently available textual analysis tools with a newly implemented software application. Our results show that the CBDF method using word frequencies gave discrimination closest to the human rated samples.

KW - CBDF

KW - chi by degrees of freedom

KW - log-likelihood measures

M3 - Conference contribution book

SN - 9780947649821

BT - Corpora and Language Technologies in Teaching, Learning and Research

A2 - Weir, George

A2 - Ishikawa, S.

A2 - Poonpol, K.

ER -

Gellineau O, Weir GRS. Classifying suspicious content using frequency analysis. In Weir G, Ishikawa S, Poonpol K, editors, Corpora and Language Technologies in Teaching, Learning and Research. 2011