Cloud-based textual analysis as a basis for document classification

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Growing trends in data mining and developments in machine learning, have encouraged interest in analytical techniques that can contribute insights on data characteristics. The present paper describes an approach to textual analysis that generates extensive quantitative data on target documents, with output including frequency data on tokens, types, parts-of-speech and word n-grams. These analytical results enrich the available source data and have proven useful in several contexts as a basis for automating manual classification tasks. In the following, we introduce the Posit textual analysis toolset and detail its use in data enrichment as input to supervised learning tasks, including automating the identification of extremist Web content. Next, we describe the extension of this approach to Arabic language. Thereafter, we recount the move of these analytical facilities from local operation to a Cloud-based service. This transition, affords easy remote access for other researchers seeking to explore the application of such data enrichment to their own text-based data sets.

LanguageEnglish
Title of host publication2018 International Conference on High Performance Computing & Simulation (HPCS)
EditorsKhalid Zine-Dine, Waleed W. Smari
Place of PublicationPiscataway, New Jersey
PublisherIEEE
Pages629-633
Number of pages5
ISBN (Print)9781538678787
DOIs
Publication statusE-pub ahead of print - 1 Nov 2018
Event16th International Conference on High Performance Computing and Simulation, HPCS 2018 - Orleans, France
Duration: 16 Jul 201820 Jul 2018

Conference

Conference16th International Conference on High Performance Computing and Simulation, HPCS 2018
CountryFrance
CityOrleans
Period16/07/1820/07/18

Fingerprint

Document Classification
Supervised learning
Data mining
Learning systems
N-gram
Supervised Learning
Data Mining
Machine Learning
Target
Output

Keywords

  • classification
  • cloud-service
  • data mining
  • featureset
  • posit
  • textual analysis

Cite this

Weir, G., Owoeye, K., Oberacker, A., & Alshahrani, H. (2018). Cloud-based textual analysis as a basis for document classification. In K. Zine-Dine, & W. W. Smari (Eds.), 2018 International Conference on High Performance Computing & Simulation (HPCS) (pp. 629-633). Piscataway, New Jersey: IEEE. https://doi.org/10.1109/HPCS.2018.00110
Weir, George ; Owoeye, Kolade ; Oberacker, Alice ; Alshahrani, Haya. / Cloud-based textual analysis as a basis for document classification. 2018 International Conference on High Performance Computing & Simulation (HPCS). editor / Khalid Zine-Dine ; Waleed W. Smari. Piscataway, New Jersey : IEEE, 2018. pp. 629-633
@inproceedings{8d44148bb01b49138781da1a1d6bc6a3,
title = "Cloud-based textual analysis as a basis for document classification",
abstract = "Growing trends in data mining and developments in machine learning, have encouraged interest in analytical techniques that can contribute insights on data characteristics. The present paper describes an approach to textual analysis that generates extensive quantitative data on target documents, with output including frequency data on tokens, types, parts-of-speech and word n-grams. These analytical results enrich the available source data and have proven useful in several contexts as a basis for automating manual classification tasks. In the following, we introduce the Posit textual analysis toolset and detail its use in data enrichment as input to supervised learning tasks, including automating the identification of extremist Web content. Next, we describe the extension of this approach to Arabic language. Thereafter, we recount the move of these analytical facilities from local operation to a Cloud-based service. This transition, affords easy remote access for other researchers seeking to explore the application of such data enrichment to their own text-based data sets.",
keywords = "classification, cloud-service, data mining, featureset, posit, textual analysis",
author = "George Weir and Kolade Owoeye and Alice Oberacker and Haya Alshahrani",
year = "2018",
month = "11",
day = "1",
doi = "10.1109/HPCS.2018.00110",
language = "English",
isbn = "9781538678787",
pages = "629--633",
editor = "Khalid Zine-Dine and Smari, {Waleed W.}",
booktitle = "2018 International Conference on High Performance Computing & Simulation (HPCS)",
publisher = "IEEE",

}

Weir, G, Owoeye, K, Oberacker, A & Alshahrani, H 2018, Cloud-based textual analysis as a basis for document classification. in K Zine-Dine & WW Smari (eds), 2018 International Conference on High Performance Computing & Simulation (HPCS). IEEE, Piscataway, New Jersey, pp. 629-633, 16th International Conference on High Performance Computing and Simulation, HPCS 2018, Orleans, France, 16/07/18. https://doi.org/10.1109/HPCS.2018.00110

Cloud-based textual analysis as a basis for document classification. / Weir, George; Owoeye, Kolade; Oberacker, Alice; Alshahrani, Haya.

2018 International Conference on High Performance Computing & Simulation (HPCS). ed. / Khalid Zine-Dine; Waleed W. Smari. Piscataway, New Jersey : IEEE, 2018. p. 629-633.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Cloud-based textual analysis as a basis for document classification

AU - Weir, George

AU - Owoeye, Kolade

AU - Oberacker, Alice

AU - Alshahrani, Haya

PY - 2018/11/1

Y1 - 2018/11/1

N2 - Growing trends in data mining and developments in machine learning, have encouraged interest in analytical techniques that can contribute insights on data characteristics. The present paper describes an approach to textual analysis that generates extensive quantitative data on target documents, with output including frequency data on tokens, types, parts-of-speech and word n-grams. These analytical results enrich the available source data and have proven useful in several contexts as a basis for automating manual classification tasks. In the following, we introduce the Posit textual analysis toolset and detail its use in data enrichment as input to supervised learning tasks, including automating the identification of extremist Web content. Next, we describe the extension of this approach to Arabic language. Thereafter, we recount the move of these analytical facilities from local operation to a Cloud-based service. This transition, affords easy remote access for other researchers seeking to explore the application of such data enrichment to their own text-based data sets.

AB - Growing trends in data mining and developments in machine learning, have encouraged interest in analytical techniques that can contribute insights on data characteristics. The present paper describes an approach to textual analysis that generates extensive quantitative data on target documents, with output including frequency data on tokens, types, parts-of-speech and word n-grams. These analytical results enrich the available source data and have proven useful in several contexts as a basis for automating manual classification tasks. In the following, we introduce the Posit textual analysis toolset and detail its use in data enrichment as input to supervised learning tasks, including automating the identification of extremist Web content. Next, we describe the extension of this approach to Arabic language. Thereafter, we recount the move of these analytical facilities from local operation to a Cloud-based service. This transition, affords easy remote access for other researchers seeking to explore the application of such data enrichment to their own text-based data sets.

KW - classification

KW - cloud-service

KW - data mining

KW - featureset

KW - posit

KW - textual analysis

UR - http://www.scopus.com/inward/record.url?scp=85057432298&partnerID=8YFLogxK

UR - https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8513913

U2 - 10.1109/HPCS.2018.00110

DO - 10.1109/HPCS.2018.00110

M3 - Conference contribution

SN - 9781538678787

SP - 629

EP - 633

BT - 2018 International Conference on High Performance Computing & Simulation (HPCS)

A2 - Zine-Dine, Khalid

A2 - Smari, Waleed W.

PB - IEEE

CY - Piscataway, New Jersey

ER -

Weir G, Owoeye K, Oberacker A, Alshahrani H. Cloud-based textual analysis as a basis for document classification. In Zine-Dine K, Smari WW, editors, 2018 International Conference on High Performance Computing & Simulation (HPCS). Piscataway, New Jersey: IEEE. 2018. p. 629-633 https://doi.org/10.1109/HPCS.2018.00110