Towards a universal information distance for structured data

Richard Connor, Fabio Simeoni, Michael Iakovos, Robert George Moss

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

2 Citations (Scopus)

Abstract

The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of images, multimedia, and semi-structured data. There are however two, largely separate, classes of related research. On the one hand, techniques such as clustering and similarity search give general treatments over sets of data. Results are domain-independent, typically relying only on the existence of an anonymous distance metric over the set in question. On the other hand, results in the domain of similarity measurement are often limited to the context of pairwise comparison over individual objects, and are not typically set in a wider context. Published algorithms are scattered over various demand-led subject areas, including for example bioinformatics, library sciences, and crime detection. Few, if any, of the published algorithms have the distance metric properties. We have identified a distance metric, Ensemble Distance, which we believe can help to bridge this gap. Ensemble Distance is a non-Euclidean distance metric which we believe can be used in the treatment of many classes of structured data. For any complex type where a useful characterisation exists in the form of an ensemble, we can produce a distance metric for that type. This will in turn allow use of the complex type within off-the-shelf clustering and similarity search algorithms; this would be a major result in the management of complex data sets.
LanguageEnglish
Title of host publication SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications
EditorsAlfredo Ferro
Pages69-77
Number of pages9
DOIs
Publication statusPublished - 30 Jun 2011

Fingerprint

Crime
Bioinformatics
Lead

Keywords

  • universal information distance
  • structured data

Cite this

Connor, R., Simeoni, F., Iakovos, M., & Moss, R. G. (2011). Towards a universal information distance for structured data. In A. Ferro (Ed.), SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications (pp. 69-77) https://doi.org/10.1145/1995412.1995426
Connor, Richard ; Simeoni, Fabio ; Iakovos, Michael ; Moss, Robert George. / Towards a universal information distance for structured data. SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications. editor / Alfredo Ferro. 2011. pp. 69-77
@inproceedings{52513729eba64767afae246817f98665,
title = "Towards a universal information distance for structured data",
abstract = "The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of images, multimedia, and semi-structured data. There are however two, largely separate, classes of related research. On the one hand, techniques such as clustering and similarity search give general treatments over sets of data. Results are domain-independent, typically relying only on the existence of an anonymous distance metric over the set in question. On the other hand, results in the domain of similarity measurement are often limited to the context of pairwise comparison over individual objects, and are not typically set in a wider context. Published algorithms are scattered over various demand-led subject areas, including for example bioinformatics, library sciences, and crime detection. Few, if any, of the published algorithms have the distance metric properties. We have identified a distance metric, Ensemble Distance, which we believe can help to bridge this gap. Ensemble Distance is a non-Euclidean distance metric which we believe can be used in the treatment of many classes of structured data. For any complex type where a useful characterisation exists in the form of an ensemble, we can produce a distance metric for that type. This will in turn allow use of the complex type within off-the-shelf clustering and similarity search algorithms; this would be a major result in the management of complex data sets.",
keywords = "universal information distance , structured data",
author = "Richard Connor and Fabio Simeoni and Michael Iakovos and Moss, {Robert George}",
year = "2011",
month = "6",
day = "30",
doi = "10.1145/1995412.1995426",
language = "English",
isbn = "978-1-4503-0795-6",
pages = "69--77",
editor = "Alfredo Ferro",
booktitle = "SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications",

}

Connor, R, Simeoni, F, Iakovos, M & Moss, RG 2011, Towards a universal information distance for structured data. in A Ferro (ed.), SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications. pp. 69-77. https://doi.org/10.1145/1995412.1995426

Towards a universal information distance for structured data. / Connor, Richard; Simeoni, Fabio; Iakovos, Michael; Moss, Robert George.

SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications. ed. / Alfredo Ferro. 2011. p. 69-77.

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - Towards a universal information distance for structured data

AU - Connor, Richard

AU - Simeoni, Fabio

AU - Iakovos, Michael

AU - Moss, Robert George

PY - 2011/6/30

Y1 - 2011/6/30

N2 - The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of images, multimedia, and semi-structured data. There are however two, largely separate, classes of related research. On the one hand, techniques such as clustering and similarity search give general treatments over sets of data. Results are domain-independent, typically relying only on the existence of an anonymous distance metric over the set in question. On the other hand, results in the domain of similarity measurement are often limited to the context of pairwise comparison over individual objects, and are not typically set in a wider context. Published algorithms are scattered over various demand-led subject areas, including for example bioinformatics, library sciences, and crime detection. Few, if any, of the published algorithms have the distance metric properties. We have identified a distance metric, Ensemble Distance, which we believe can help to bridge this gap. Ensemble Distance is a non-Euclidean distance metric which we believe can be used in the treatment of many classes of structured data. For any complex type where a useful characterisation exists in the form of an ensemble, we can produce a distance metric for that type. This will in turn allow use of the complex type within off-the-shelf clustering and similarity search algorithms; this would be a major result in the management of complex data sets.

AB - The similarity of objects is one of the most fundamental concepts in any collection of complex information; similarity, along with techniques for storing and indexing sets of values based on it, is a concept of ever increasing importance as inherently unordered data sets become ever more common. Examples of such datasets include collections of images, multimedia, and semi-structured data. There are however two, largely separate, classes of related research. On the one hand, techniques such as clustering and similarity search give general treatments over sets of data. Results are domain-independent, typically relying only on the existence of an anonymous distance metric over the set in question. On the other hand, results in the domain of similarity measurement are often limited to the context of pairwise comparison over individual objects, and are not typically set in a wider context. Published algorithms are scattered over various demand-led subject areas, including for example bioinformatics, library sciences, and crime detection. Few, if any, of the published algorithms have the distance metric properties. We have identified a distance metric, Ensemble Distance, which we believe can help to bridge this gap. Ensemble Distance is a non-Euclidean distance metric which we believe can be used in the treatment of many classes of structured data. For any complex type where a useful characterisation exists in the form of an ensemble, we can produce a distance metric for that type. This will in turn allow use of the complex type within off-the-shelf clustering and similarity search algorithms; this would be a major result in the management of complex data sets.

KW - universal information distance

KW - structured data

U2 - 10.1145/1995412.1995426

DO - 10.1145/1995412.1995426

M3 - Conference contribution book

SN - 978-1-4503-0795-6

SP - 69

EP - 77

BT - SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications

A2 - Ferro, Alfredo

ER -

Connor R, Simeoni F, Iakovos M, Moss RG. Towards a universal information distance for structured data. In Ferro A, editor, SISAP '11 Proceedings of the Fourth International Conference on SImilarity Search and APplications. 2011. p. 69-77 https://doi.org/10.1145/1995412.1995426