Building a large-scale corpus for evaluating event detection on twitter

Andrew J. McMinn, Yashar Moshfeghi, Joemon M. Jose

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

79 Citations (Scopus)

Abstract

Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.
LanguageEnglish
Title of host publicationProceedings of the 22nd ACM International Conference on Information Knowledge Management
Place of PublicationNew York, NY
Pages409-418
Number of pages10
DOIs
Publication statusPublished - 27 Oct 2013
Event22nd ACM International Conference on Information & Knowledge Management - San Francisco, United States
Duration: 27 Oct 20131 Nov 2013

Conference

Conference22nd ACM International Conference on Information & Knowledge Management
CountryUnited States
CitySan Francisco
Period27/10/131/11/13

Keywords

  • test collection
  • twitter
  • event detection
  • social media
  • crowdsourcing
  • reproducibility

Cite this

McMinn, A. J., Moshfeghi, Y., & Jose, J. M. (2013). Building a large-scale corpus for evaluating event detection on twitter. In Proceedings of the 22nd ACM International Conference on Information Knowledge Management (pp. 409-418). New York, NY. https://doi.org/10.1145/2505515.2505695
McMinn, Andrew J. ; Moshfeghi, Yashar ; Jose, Joemon M. / Building a large-scale corpus for evaluating event detection on twitter. Proceedings of the 22nd ACM International Conference on Information Knowledge Management. New York, NY, 2013. pp. 409-418
@inproceedings{1661b73e0c7d48829febc56dae4119b0,
title = "Building a large-scale corpus for evaluating event detection on twitter",
abstract = "Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.",
keywords = "test collection, twitter, event detection, social media, crowdsourcing, reproducibility",
author = "McMinn, {Andrew J.} and Yashar Moshfeghi and Jose, {Joemon M.}",
year = "2013",
month = "10",
day = "27",
doi = "10.1145/2505515.2505695",
language = "English",
isbn = "9781450322638",
pages = "409--418",
booktitle = "Proceedings of the 22nd ACM International Conference on Information Knowledge Management",

}

McMinn, AJ, Moshfeghi, Y & Jose, JM 2013, Building a large-scale corpus for evaluating event detection on twitter. in Proceedings of the 22nd ACM International Conference on Information Knowledge Management. New York, NY, pp. 409-418, 22nd ACM International Conference on Information & Knowledge Management, San Francisco, United States, 27/10/13. https://doi.org/10.1145/2505515.2505695

Building a large-scale corpus for evaluating event detection on twitter. / McMinn, Andrew J.; Moshfeghi, Yashar; Jose, Joemon M.

Proceedings of the 22nd ACM International Conference on Information Knowledge Management. New York, NY, 2013. p. 409-418.

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - Building a large-scale corpus for evaluating event detection on twitter

AU - McMinn, Andrew J.

AU - Moshfeghi, Yashar

AU - Jose, Joemon M.

PY - 2013/10/27

Y1 - 2013/10/27

N2 - Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.

AB - Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.

KW - test collection

KW - twitter

KW - event detection

KW - social media

KW - crowdsourcing

KW - reproducibility

UR - https://dl.acm.org/

UR - https://www.cikm2013.org/

U2 - 10.1145/2505515.2505695

DO - 10.1145/2505515.2505695

M3 - Conference contribution book

SN - 9781450322638

SP - 409

EP - 418

BT - Proceedings of the 22nd ACM International Conference on Information Knowledge Management

CY - New York, NY

ER -

McMinn AJ, Moshfeghi Y, Jose JM. Building a large-scale corpus for evaluating event detection on twitter. In Proceedings of the 22nd ACM International Conference on Information Knowledge Management. New York, NY. 2013. p. 409-418 https://doi.org/10.1145/2505515.2505695