Projects per year
Abstract
We investigate a distance metric, previously defined for the measurement of structured data, in the more general context of vector spaces. The metric has a basis in information theory and assesses the distance between two vectors in terms of their relative information content. The resulting metric gives an outcome based on the dimensional correlation, rather than magnitude, of the input vectors, in a manner similar to Cosine Distance. In this paper the metric is defined, and assessed, in comparison with Cosine Distance, for its major properties: semantics, properties for use within similarity search, and evaluation efficiency.
We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a humanratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for highdimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently.
Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
We find that it is fairly well correlated with Cosine Distance in dense spaces, but its semantics are in some cases preferable. In a sparse space, it significantly outperforms Cosine Distance over TREC data and queries, the only large collection for which we have a humanratified ground truth. This result is backed up by another experiment over movielens data. In dense Cartesian spaces it has better properties for use with similarity indices than either Cosine or Euclidean Distance. In its definitional form it is very expensive to evaluate for highdimensional sparse vectors; to counter this, we show an algebraic rewrite which allows its evaluation to be performed more efficiently.
Overall, when a multivariate correlation metric is required over positive vectors, SED seems to be a better choice than Cosine Distance in many circumstances.
Original language  English 

Title of host publication  Similarity search and applications 
Subtitle of host publication  5th international conference, SISAP 2012 proceedings 
Editors  Gonzalo Nararro, Vladimir Pestov 
Place of Publication  Berlin 
Publisher  SpringerVerlag 
Pages  209225 
Number of pages  17 
ISBN (Print)  9783642321528 
DOIs  
Publication status  Published  2012 
Event  5th International Conference on Similarity Search and Applications (SISAP)  Toronto, Canada Duration: 9 Aug 2012 → 10 Aug 2012 
Publication series
Name  Lecture Notes in Computer Science 

Publisher  Springer 
Volume  7404 
ISSN (Print)  03029743 
Conference
Conference  5th International Conference on Similarity Search and Applications (SISAP) 

Country  Canada 
City  Toronto 
Period  9/08/12 → 10/08/12 
Keywords
 correlation distance
 vector spaces
 multivariate
 distance metric
 similarity search
 cosine distance
 multivariate correlation
Fingerprint Dive into the research topics of 'A multivariate correlation distance for vector spaces'. Together they form a unique fingerprint.
Projects
 1 Finished

Structural Comparison of Labelled Graph Data
Connor, R.
EPSRC (Engineering and Physical Sciences Research Council)
1/10/09 → 30/09/12
Project: Research