Indexing without spam

Guido Zuccon, Teerapong Leelanupab, Anthony Nguyen, Leif Azzopardi

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

4 Citations (Scopus)

Abstract

The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user's query. In this paper we propose removing spam pages at indexing time, therefore obtaining a pruned index that is virtually "spam-free". We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performance. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection's index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.

LanguageEnglish
Title of host publicationADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium
Place of PublicationMelbourne, Vic.
Pages6-13
Number of pages8
Publication statusPublished - 1 Dec 2011
Event16th Australasian Document Computing Symposium, ADCS 2011 - Canberra, ACT, Australia
Duration: 2 Dec 20112 Dec 2011

Conference

Conference16th Australasian Document Computing Symposium, ADCS 2011
CountryAustralia
CityCanberra, ACT
Period2/12/112/12/11

Fingerprint

Search engines

Keywords

  • efficiency
  • index pruning
  • information retrieval
  • spam
  • web search
  • indexing (of information)
  • search engines
  • document ranking

Cite this

Zuccon, G., Leelanupab, T., Nguyen, A., & Azzopardi, L. (2011). Indexing without spam. In ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium (pp. 6-13). Melbourne, Vic..
Zuccon, Guido ; Leelanupab, Teerapong ; Nguyen, Anthony ; Azzopardi, Leif. / Indexing without spam. ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium. Melbourne, Vic., 2011. pp. 6-13
@inproceedings{4a8ddde1b7e6465fa1bbabd2fb65a809,
title = "Indexing without spam",
abstract = "The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user's query. In this paper we propose removing spam pages at indexing time, therefore obtaining a pruned index that is virtually {"}spam-free{"}. We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performance. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection's index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.",
keywords = "efficiency, index pruning, information retrieval, spam, web search, indexing (of information), search engines, document ranking",
author = "Guido Zuccon and Teerapong Leelanupab and Anthony Nguyen and Leif Azzopardi",
year = "2011",
month = "12",
day = "1",
language = "English",
isbn = "9781921426926",
pages = "6--13",
booktitle = "ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium",

}

Zuccon, G, Leelanupab, T, Nguyen, A & Azzopardi, L 2011, Indexing without spam. in ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium. Melbourne, Vic., pp. 6-13, 16th Australasian Document Computing Symposium, ADCS 2011, Canberra, ACT, Australia, 2/12/11.

Indexing without spam. / Zuccon, Guido; Leelanupab, Teerapong; Nguyen, Anthony; Azzopardi, Leif.

ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium. Melbourne, Vic., 2011. p. 6-13.

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - Indexing without spam

AU - Zuccon, Guido

AU - Leelanupab, Teerapong

AU - Nguyen, Anthony

AU - Azzopardi, Leif

PY - 2011/12/1

Y1 - 2011/12/1

N2 - The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user's query. In this paper we propose removing spam pages at indexing time, therefore obtaining a pruned index that is virtually "spam-free". We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performance. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection's index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.

AB - The presence of spam in a document ranking is a major issue for Web search engines. Common approaches that cope with spam remove from the document rankings those pages that are likely to contain spam. These approaches are implemented as post-retrieval processes, that filter out spam pages only after documents have been retrieved with respect to a user's query. In this paper we propose removing spam pages at indexing time, therefore obtaining a pruned index that is virtually "spam-free". We investigate the benefits of this approach from three points of view: indexing time, index size, and retrieval performance. Not surprisingly, we found that the strategy decreases both the time required by the indexing process and the space required for storing the index. Surprisingly instead, we found that by considering a spam-pruned version of a collection's index, no difference in retrieval performance is found when compared to that obtained by traditional post-retrieval spam filtering approaches.

KW - efficiency

KW - index pruning

KW - information retrieval

KW - spam

KW - web search

KW - indexing (of information)

KW - search engines

KW - document ranking

UR - http://www.scopus.com/inward/record.url?scp=84871627450&partnerID=8YFLogxK

M3 - Conference contribution book

SN - 9781921426926

SP - 6

EP - 13

BT - ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium

CY - Melbourne, Vic.

ER -

Zuccon G, Leelanupab T, Nguyen A, Azzopardi L. Indexing without spam. In ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium. Melbourne, Vic. 2011. p. 6-13