Semistructured data is an important data format, essentially embodied by the XML standard. It is increasingly used in many critical applications, especially in business-to-business communications and peer-to-peer traffic on the Internet. The main advantage of the semistructured format is that it increases the flexibility of the way the data may be structured: as new situations arise, the structure of the data may evolve as well as the values. For example, given an established stream of business messages about car insurance between multiple brokers and underwriters, one underwriter may decide that the colour of a car is a significant factor not currently included in the data being supplied. They can advertise this fact, and brokers may choose to start asking their clients for the colour of their cars. From this point onwards, a field may start to appear in the messages between brokers and underwriters, the field being optionally included by brokers and optionally acted upon by underwriters where it is present. When much of this kind of activity takes place, it becomes important to consider the structural attributes of the whole, potentially large, pool of data, as well as the individual items. For example, given a set of data items: do they have anything much in common with each other?; do they all have at least something in common, and if so what?; how different is one given item from another, and are there any others exactly the same as this one?; can one or more clusters of similarly or identically-structured items be identified within the pool?; is the pool of data itself evolving or becoming quiescent, that is, over time, are individual items becoming, on the whole, more different or more similar? Recent work we have done on the inherent complexity, and thus regularity, of semistructured data items has led us to an observation that we believe will give great insights into how to answer these and other similar questions. Using some long-established results from Information Theory, we have applied the concept of mechanical entropy to the domain of semistructured data to give a metric for the complexity of individual data items. We have also discovered an efficient way of calculating this, by use of a data structure, the structural fingerprint, which represents the essential structure of the item. We now believe that the reapplication of this work into the above context will give a great leverage in terms of producing useful, quantified answers to the above questions and others, while the use of the structural fingerprint will make it computationally feasible to perform these calculations upon large pools of semistructured data in the global domain of the Internet.
At the start of this project, the concept of structural entropy had been recently defined by the PI as a way of quantifying the observable complexity of data whose detailed structure was unknown. Databases contain data with very regular structure defined by a schema, but more modern data resources, in particular XML and other forms of "semi-structured" data, do not specify the actual format of data, instead outlining only possible different formats. Because of this, the degree of complexity involved in programming against incoming data, even when it conforms to a written definition, cannot be understood in the absence of the data itself and such a metric becomes useful.
The research performed in this project takes the notion of structural entropy a stage further, in using the same concept to quantify differences between information sources.
In outline terms, structural entropy gives some measurement for the "complexity" of an object; complexity here is used in the sense of information theory, and can be thought of as a measure of the quantity of information required to create that object. By analogy, a cubic block of wood and a Christmas tree may have the same constituent components, but a great deal more information is required to build the Christmas tree than the cube from these components, therefore its measurable complexity is much greater.
The key observation that has driven the research is that a comparative evaluation of similarity may be achieved, for any two objects, by comparing their individual complexities against some notion of the complexity of a single, compound object created by their conceptual "merge". As the block and the tree have so little shared information, the task of creating their merge is as great as the sum of their individual complexities; however having created one block, or one tree, no further information is required to create another. For something quite similar, but not identical, only partial further information would be required to create a merged structure: thus the difference between a branch and a tree would be much less than the difference between a branch and a block.
During this project, this concept has been fully formalised and quantified in a formal metric definition, which we have called Structural Entropic Distance. The first key property of this metric that it is bounded: two items which share no common components have a difference of 1.0, and two identical items have a difference of 0; all other differences are in between these extremes. This is a useful property not only in that it allows detection of items with nothing in common, but more importantly it means that differences between different sets of objects can be usefully compared. The other key property is that is is a "proper" metric, meaning that it has certain mathematical properties allowing it to be used, most importantly, in other algorithms for clustering and search based on similarity, using "metric space" techniques.
Within the project, we have shown how the techniques we established can be used to measure the differences between a range of structured objects, and how this can be used to efficiently search large spaces of complex objects for similarity. Such techniques are essential to allow improvement in the emerging world of the "Internet of Things", where searching based on similarity to reference objects is becoming increasingly important.
Key Findings:
The key findings of the project:
1.the identification of Structural Entropic Distance, a bounded proper distance metric over semistructured data trees.
2.the generalisation of the distance metric over other forms of structured data
3.the comparison of SED with other correlation distances, notably Cosine distance, and proof of its superior semantic properties in a number of contexts
Project website:
www.similarity.eu