Coalgebraic Foundations of Semi-Structured Data (EPSRC First Grant)

Project: Research

Description

"Databases are irreplaceable in our modern information society. Classically, information has been stored in rigidly structured databases employing the relational data model and the query language SQL. This has led to highly optimised relational database management systems such as Oracle or Mircrosoft SQL that have large-scale industrial deployment. Underpinning much of the success of these systems has been their close connection with their mathematical foundations within relational algebra and mathematical logic. However, the internet has dramatically changed our understanding of data and databases: i) data on the Web is inherently hierarchical and arranged in a network/graph; and ii) the decentralised nature of the Web means that data comes from various heterogenous sources, is often incomplete or unreliable and has no uniform structure. Still, Web data retains some structure and, consequently, semi-structured data seeks to understand what structure persists and how it can be utilised. Examples of widespread, industrially-used data models are XML, JSON and RDF.

Mathematically, semi-structured data is usually represented as unranked labelled trees where nodes are accessed via the parent, child and sibling relations or labelled graphs where nodes are accessed via the edge relation. Data elements are stored at the nodes. Due to the special features of Web data mentioned above, query languages for semi-structured data face significantly greater challenges compared to those for data stored in a relational database: i) queries must navigate the path structure within a tree or graph and query the data elements found along such paths to explore important properties of the data; and ii) one must add a common vocabulary to the data - often via an ontology - that provides a logical layer for integrating semantically related data from heterogeneous sources. But as semi-structured data models have become increasingly sophisticated and expressive - e.g. in order to model the uncertainty attached to the data - the development of matching query and ontology languages have struggled to keep pace. Indeed, while there is a large body of work on specific semi-structured data models and their corresponding query languages there is currently no comprehensive theory that both accounts for existing semi-structured data formats and their query languages, and is able to guide their extension to the next generation of semi-structured data formats. For example, while the widely used query language XPath has been recently extended from XML to graph data, there is currently no agreed mechanism to further extend XPath to graph data with uncertainty.

Our central insight is that coalgebra provides the right level of abstraction to underpin a comprehensive theory of query and ontology languages for semi-structured data. This is because i) coalgebra generalises - via coalgebraic modal logic - the usual modal logics used for traversing trees and graphs; and ii) coalgebra generalises - via coalgebraic logic programming - standard rule-based ontology languages which are based upon logic programming."
StatusFinished
Effective start/end date1/02/1631/01/18

Funding

  • EPSRC (Engineering and Physical Sciences Research Council): £99,423.00

Fingerprint

Query languages
Ontology
Data structures
Logic programming
XML
Formal logic
Algebra
Internet
Uncertainty