Distributed Information Retrieval (DIR), also known as content-based federated search, is concerned with enabling a user to find unstructured or poorly structured documents by their semantic content using natural language queries expressing his/her information needs. The difference with standard Information Retrieval (IR) is that documents are contained in a number of heterogeneous distributed resources, each with its own different retrieval engine. When so many resources are available, the first information access task the user faces is resource selection. This is an ineffective manual task as users are often unaware of the contents of each resource in terms of quantity, quality, information type, provenance and likely relevance. People need accurate automatic resource selection tools to assist them in this task, but resource selection requires accurate resource descriptions. These descriptions can be built either manually or automatically, and are a real problem to derive in the case of non-cooperative resources, that is in the case of resources that enable access to their content by querying their search engines, but that do not provide any information about the content of their archives. This is the typical case for resources in the Deep or Hidden Web. A rough estimate put the size of these resources at 400 times that of the Visible Web. Once the resources have been selected and the query forwarded to them, the results returned by each one of them have to be merged by a process called results fusion, so that a single ranked list of results is produced and presented to the user, trying to maximise the overall retrieval quality. A large body of research in the last 10 years has shown that the effectiveness of IR systems can be greatly improved by adapting the system to the specific user tasks and needs. This enables to satisfy the user information need taking into consideration the context in which the user need is placed, so personalising the interaction with the system. While a large body of work already exists for personalised and context dependent IR, the DIR research area has not yet considered issues of personalisation. We believe that designing methods for adaptive DIR, that enable a DIR system to automatically adapt to the user needs and task, is as important as designing them for standard IR and will bring similar benefits. This project is concerned with designing, implementing and testing models of personalised content-based federated search that will be applied to retrieving information from the Deep Web. The objective of this project will be achieved by designing, implementing and testing advanced resource description, resource selection and results fusions methods that can be automatically personalised to the user task and user needs and that are specifically designed to access information held in the Deep Web (i.e.~in non-cooperative and heterogeneous resources). To the best of our knowledge this proposal is the first to attempt to tackle this very important and up-coming area of research, and will advance the state of the art of DIR and of each and every component of the DIR process. The results of this work will also have considerable commercial implications for the design of systems that will enable access to the vast wealth of resources of the Deep Web that are, currently, mostly untapped.
|Effective start/end date||1/04/08 → 31/10/10|
- EPSRC (Engineering and Physical Sciences Research Council): £188,797.00