MyEDB

Description

Date depot: 1 janvier 1900
Titre: Intelligent Content Acquisition in Web Archiving
Directeur de thèse: Pierre SENELLART (DIENS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: Background ---------- Since its beginnings in the 1990s, the set of hyperlinked resources known as the World Wide Web has become an extremely precious source of information. For easy access to this content, search engines such as Google or Bing crawl the Web and index pages. But crawling technologies also have an important application in the field of Web archiving [1]: the storage of (versioned) archives of the Web, to ensure continuous access to its resources despite data deletion, to carry out temporal analyzes of its content, or to provide a user the possibility to see a Web page or a Web site as it was at a given date in the past. This PhD topic is part of the ARCOMEM European project (2011-2013). This project, involving 12 institutions around Europe, aims at leveraging the wisdom of the crowds for intelligent preservation of Web content. PhD Topic Description --------------------- The purpose of this PhD is to develop intelligent and adaptive models, methods, algorithms, and tools for making the content acquisition process in Web archiving more effective and efficient. The objective is to leverage existing work (in particular inside the ARCOMEM consortium) on the extraction and analysis of events, entities, topics, opinions, perceptions, etc., to select and prioritize sources to be crawled. This will be combined with techniques that go beyond traditional page-level crawling, allowing object-level, goal-driven crawling of the Web. The focus is on unsupervised methods, that can scale to the whole Web. In particular, the following aspects will be covered: - assessing the relevance [2], importance, and coverage of available content with respect to a Web archiving task; - combining evidence to select or prioritize the crawling process [3]; - accessing content at the level of objects inside Web pages [4] and hidden behind deep Web forms [5] or Web 2.0 applications. Depending on the content to be archived (social networks, structured Web, deep Web, etc.), different solutions for assessing relevance, prioritizing the crawling, and extracting Web objects can be proposed. References ---------- [1] Julien Masanes, editor. Web Archiving. Springer-Verlag, 2006. [2] A. Dimulescu and J.-L. Dessalles. Understanding narrative interest: Some evidence on the role of unexpectedness. In CogSci, Amsterdam, NL, July 2009. [3] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: A new approach to topic-specific Web resource discovery. Computer Networks, 31(11-16):1623-1640, 1999. [4] Talel Abdessalem, Bogdan Cautis, and Nora Derouiche. ObjectRunner: Lightweight, targeted extraction and querying of structured Web data. PVLDB, 3(2):1585-1588, 2010. [5] Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y. Halevy. Harnessing the deep Web: Present and future. In CIDR, 2009.

Doctorant.e: Faheem Muhammad

Projet de recherche doctoral numero :2865

Description