Projet de recherche doctoral numero :3863

Description

Date depot: 1 janvier 1900
Titre: Uncertainty over Intensional Data
Directeur de thèse: Pierre SENELLART (DIENS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: In contrast with traditional data management applications, a wealth of useful data available today is uncertain and limited by access restrictions. For instance, though originally intended for human consumption, the World Wide Web is currently the most comprehensive information source accessible to automated agents. A major challenge is to structure this information and answer complex queries over it. To do so, many types of data sources must be integrated: semantic Web sources such as SPARQL endpoints which provide structured data; hidden Web sources out of which information can be extracted by crawling and wrapper induction; semi-structured HTML documents and natural language text on which information extraction can be applied; and possibly human sources in the context of crowdsourcing. However, information on the Web is {uncertain}: first, the data itself can be biased, wrong, or annotated with confidence values; second, uncertainty can accumulate over the various steps needed to extract the data in a structured form. Besides, information on the Web is {intensional}: the Web does not reside on one machine, but is distributed and must be accessed piece by piece by querying a wide range of separate independent services. For this reason, the interplay between uncertainty and intensionality is at the heart of the management of structured data obtained from the Web. {{{Uncertainty}}} Uncertain data can be seen mathematically as a set of possible states of the data (i.e., the possible worlds), which can be equipped with a probability distribution. However, because there can be an infinite number of possible worlds, a finite implicit representation of the uncertain data is needed to perform computation on it. There are numerous formalisms for such representations, depending on the kind of data being studied. Probabilistic relational databases store uncertain relational data and are designed to answer relational queries efficiently. For tree-like data such as XML, numerous probabilistic frameworks exist. On a more fundamental level, the usual representation of an uncertain string is a weighted finite-state automaton for the string, which links the management of uncertainty to the theory of formal languages. The different ways to represent uncertain data can be compared according to their expressiveness (which probability distributions can be represented by a given model and how are the representations computed?), their size (how large is the representation of a distribution and how far can a representation be compressed or approximated?), and their efficiency (what is the complexity of computing such a model from examples, or of answering queries over such models?). The performance of the various models over these criteria and the relations between those models are not yet adequately understood. {{{Intensionality}}} Unlike in traditional data management settings, data from the Web must often be examined in an intensional manner: the data is not really present on the computing device but can be provided on-demand by separate services. Hard constraints such as access patterns, or soft constraints such as bandwidth limitations, may be imposed when accessing the services. In this setting, several important problems arise. First, the intensional nature of the Web makes it difficult to determine if the set of available access patterns is sufficient to evaluate a given query. The decidability and complexity of problems such as the immediate or long-term relevance of an access for a certain query have been studied and is closely linked to the problem of query containment for various query languages. Second, even if a query is feasible, it may be hard to find an execution plan which does not impose unreasonable load on the services. Because crawling the entire Web to answer a query is not realistic, we must limit the number of accesses we perform. In extreme cases such as crowdsourcing in which some services are provided by humans, the cost of each individual access may be very high. {{{Linking Uncertainty and Intensionality}}} Uncertainty can be used as a way to manage intensionality: data which has been accessed can be uncertain, but data which has not yet been accessed is even more uncertain, and we can see queries as a way to spend resources to reduce uncertainty. Uncertainty can also be used as a way to manage overlapping, conflicting or incomplete information from independent intensional sources. Ideally, our estimation of the uncertainty of data should be a combination of the existing probability distribution of the data, the combined confidence of the various extraction steps, the proportion in which involved sources agree, and a tradeoff between harvesting more information and limiting the resources used by a query.

Doctorant.e: Amarilli Antoine