Connecting firm's web scraped textual content to body of science: utilizing Microsoft Academic Graph hierarchical topic modeling

Arash Hajikhani*, Lukas Pukelis, Arho Suominen, Sajad Ashouri, Torben Schubert, Ad Notten, Scott W. Cunningham

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)
54 Downloads (Pure)

Abstract

This paper demonstrates a method to transform and link textual information scraped from companies' websites to the scientific body of knowledge. The method illustrates the benefit of Natural Language Processing (NLP) in creating links between established economic classification systems with novel and agile constructs that new data sources enable. Therefore, we experimented on the European classification of economic activities (known as NACE) on sectoral and company levels. We established a connection with Microsoft Academic Graph hierarchical topic modeling based on companies' website content. Central to the operationalization of our method are a web scraping process, NLP and a data transformation/linkage procedure. The method contains three main steps: data source identification, raw data retrieval, and data preparation and transformation. These steps are applied to two distinct data sources.
Original languageEnglish
Article number101650
Number of pages10
JournalMethodsX
Volume9
Early online date10 Mar 2022
DOIs
Publication statusPublished - 10 Mar 2022

Keywords

  • Natural Language Processing (NLP)
  • NACE
  • data methods
  • data transformation

Fingerprint

Dive into the research topics of 'Connecting firm's web scraped textual content to body of science: utilizing Microsoft Academic Graph hierarchical topic modeling'. Together they form a unique fingerprint.

Cite this