A multivariate correlation distance for vector spaces

Richard Connor, Robert George Moss

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

6 Citations (Scopus)

Abstract

We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency.
We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently.
Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
LanguageEnglish
Title of host publicationSimilarity search and applications
Subtitle of host publication5th international conference, SISAP 2012 proceedings
EditorsGonzalo Nararro, Vladimir Pestov
Place of PublicationBerlin
PublisherSpringer-Verlag
Pages209-225
Number of pages17
ISBN (Print)9783642321528
DOIs
Publication statusPublished - 2012
Event5th International Conference on Similarity Search and Applications (SISAP) - Toronto, Canada
Duration: 9 Aug 201210 Aug 2012

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume7404
ISSN (Print)0302-9743

Conference

Conference5th International Conference on Similarity Search and Applications (SISAP)
CountryCanada
CityToronto
Period9/08/1210/08/12

Fingerprint

Vector spaces
Semantics
Information theory
Experiments

Keywords

  • correlation distance
  • vector spaces
  • multivariate
  • distance metric
  • similarity search
  • cosine distance
  • multivariate correlation

Cite this

Connor, R., & Moss, R. G. (2012). A multivariate correlation distance for vector spaces. In G. Nararro, & V. Pestov (Eds.), Similarity search and applications: 5th international conference, SISAP 2012 proceedings (pp. 209-225). (Lecture Notes in Computer Science; Vol. 7404). Berlin: Springer-Verlag. https://doi.org/10.1007/978-3-642-32153-5_15
Connor, Richard ; Moss, Robert George. / A multivariate correlation distance for vector spaces. Similarity search and applications: 5th international conference, SISAP 2012 proceedings. editor / Gonzalo Nararro ; Vladimir Pestov. Berlin : Springer-Verlag, 2012. pp. 209-225 (Lecture Notes in Computer Science).
@inproceedings{af129b0ce4d24f52ba87edb1ad9623b7,
title = "A multivariate correlation distance for vector spaces",
abstract = "We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.",
keywords = "correlation distance , vector spaces, multivariate, distance metric , similarity search , cosine distance , multivariate correlation",
author = "Richard Connor and Moss, {Robert George}",
year = "2012",
doi = "10.1007/978-3-642-32153-5_15",
language = "English",
isbn = "9783642321528",
series = "Lecture Notes in Computer Science",
publisher = "Springer-Verlag",
pages = "209--225",
editor = "Gonzalo Nararro and Vladimir Pestov",
booktitle = "Similarity search and applications",

}

Connor, R & Moss, RG 2012, A multivariate correlation distance for vector spaces. in G Nararro & V Pestov (eds), Similarity search and applications: 5th international conference, SISAP 2012 proceedings. Lecture Notes in Computer Science, vol. 7404, Springer-Verlag, Berlin, pp. 209-225, 5th International Conference on Similarity Search and Applications (SISAP) , Toronto, Canada, 9/08/12. https://doi.org/10.1007/978-3-642-32153-5_15

A multivariate correlation distance for vector spaces. / Connor, Richard; Moss, Robert George.

Similarity search and applications: 5th international conference, SISAP 2012 proceedings. ed. / Gonzalo Nararro; Vladimir Pestov. Berlin : Springer-Verlag, 2012. p. 209-225 (Lecture Notes in Computer Science; Vol. 7404).

Research output: Chapter in Book/Report/Conference proceedingConference contribution book

TY - GEN

T1 - A multivariate correlation distance for vector spaces

AU - Connor, Richard

AU - Moss, Robert George

PY - 2012

Y1 - 2012

N2 - We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.

AB - We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency. We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a human-ratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for high-dimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently. Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.

KW - correlation distance

KW - vector spaces

KW - multivariate

KW - distance metric

KW - similarity search

KW - cosine distance

KW - multivariate correlation

UR - http://sisap.org/2012/

U2 - 10.1007/978-3-642-32153-5_15

DO - 10.1007/978-3-642-32153-5_15

M3 - Conference contribution book

SN - 9783642321528

T3 - Lecture Notes in Computer Science

SP - 209

EP - 225

BT - Similarity search and applications

A2 - Nararro, Gonzalo

A2 - Pestov, Vladimir

PB - Springer-Verlag

CY - Berlin

ER -

Connor R, Moss RG. A multivariate correlation distance for vector spaces. In Nararro G, Pestov V, editors, Similarity search and applications: 5th international conference, SISAP 2012 proceedings. Berlin: Springer-Verlag. 2012. p. 209-225. (Lecture Notes in Computer Science). https://doi.org/10.1007/978-3-642-32153-5_15