Description
Date depot: 1 janvier 1900
Titre: Intelligent Content Acquisition in Web Archiving
Directeur de thèse:
Pierre SENELLART (DIENS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
Background
----------
Since its beginnings in the 1990s, the set of hyperlinked resources known
as the World Wide Web has become an extremely precious source of
information. For easy access to this content, search engines such as
Google or Bing crawl the Web and index pages. But crawling technologies
also have an important application in the field of Web archiving [1]: the
storage of (versioned) archives of the Web, to ensure continuous access
to its resources despite data deletion, to carry out temporal analyzes of
its content, or to provide a user the possibility to see a Web page or a
Web site as it was at a given date in the past.
This PhD topic is part of the ARCOMEM European project (2011-2013). This
project, involving 12 institutions around Europe, aims at leveraging the
wisdom of the crowds for intelligent preservation of Web content.
PhD Topic Description
---------------------
The purpose of this PhD is to develop intelligent and adaptive models,
methods, algorithms, and tools for making the content acquisition process
in Web archiving more effective and efficient. The objective is to
leverage existing work (in particular inside the ARCOMEM consortium) on
the extraction and analysis of events, entities, topics, opinions,
perceptions, etc., to select and prioritize sources to be crawled. This
will be combined with techniques that go beyond traditional page-level
crawling, allowing object-level, goal-driven crawling of the Web. The
focus is on unsupervised methods, that can scale to the whole Web.
In particular, the following aspects will be covered:
- assessing the relevance [2], importance, and coverage of available
content with respect to a Web archiving task;
- combining evidence to select or prioritize the crawling process [3];
- accessing content at the level of objects inside Web pages [4] and
hidden behind deep Web forms [5] or Web 2.0 applications.
Depending on the content to be archived (social networks, structured Web,
deep Web, etc.), different solutions for assessing relevance,
prioritizing the crawling, and extracting Web objects can be proposed.
References
----------
[1] Julien Masanes, editor. Web Archiving. Springer-Verlag, 2006.
[2] A. Dimulescu and J.-L. Dessalles. Understanding narrative interest:
Some evidence on the role of unexpectedness. In CogSci, Amsterdam, NL,
July 2009.
[3] Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused
crawling: A new approach to topic-specific Web resource discovery.
Computer Networks, 31(11-16):1623-1640, 1999.
[4] Talel Abdessalem, Bogdan Cautis, and Nora Derouiche. ObjectRunner:
Lightweight, targeted extraction and querying of structured Web data.
PVLDB, 3(2):1585-1588, 2010.
[5] Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Y.
Halevy. Harnessing the deep Web: Present and future. In CIDR, 2009.
Doctorant.e: Faheem Muhammad