Projet de recherche doctoral numero :3803

Description

Date depot: 1 janvier 1900
Titre: Scalable Similarity Computation in Social Networks
Directeur de thèse: Cédric DU MOUZA (CEDRIC)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: {{Laboratories:}} laboratoire LIP6 UPMC, laboratoire CEDRIC du CNAM {{Teams:}} BD (LIP6) & ISID (CEDRIC) {{Supervisors:}} Camelia Constantin (camelia.constantin@lip6.fr) et Cédric du Mouza (dumouza@cnam.fr) {{Keywords:}} social network, graph computations, recommendation, similarity, scalability, dynamicity {{{Problem definition}}} Social network-based systems like Twitter, Facebook, etc, became widely used communication means, yet their features remain basic: posts are forwarded only to the publisher's followers, a user may perform keyword-based searches on the set of recently published posts or may consult recent posts on a detected hot topic. With the success of these systems and the growing number of users and posts, it has become difficult for a user to retrieve relevant information. Another negative point is that the user may also miss some information, hidden in the middle of a large set of non-informative posts. Therefore there is a crucial need for the user to {{select relevant posts from the large set he/she received and to be recommended publishers}} whose posts he/she is really interested in reading. Only a subset of relevant posts should be recommended to the user, based on several criteria. Users are ofter influenced by the opinions of other users in the network. This influence can be expressed in terms of {{user similarity scores}}, that take into consideration the friendship relationships in the social graph, the shared topics between users, posts forwarding, etc. One widely used class of similarity measures is based on path-ensembles computations, such as personalized PageRank, hitting time, simrank or the Katz score. They capture the intuition that the more paths there are between two nodes and the shorter these paths are, the stronger the relationship is. Computing similarities between two given nodes at query time is time consuming and can also become too expensive when the graph is too large to be stored in memory. Moreover, the construction of the recommendation model and the recommendation process has to face the dynamicity of the underlying information it is based upon. Thus, the recommendation scores change as new opinions of users are gathered, but also due to the change in the social graph's structure. {{{Objectives}}} We want to study the {{scalability of graph-based recommendation algorithms for large social networks}} and to propose methods for on-line recommendation. One important part of the work will be dedicated to optimizing the graph-based similarity scores, which involve the computation of paths between graph nodes. Some existing solutions split the graph into clusters of nodes and propose methods such that the random walk rarely cross the cluster boundaries and cause page faults. In order to reduce computation time, a simple solution would be to precompute and store on disk all pairwise similarity distances offline. Therefore, answering a similarity query online requires only a disk lookup. However, the space requirement is quadratic in the number of nodes in the graph, which is infeasible for current large social graphs. Existing works on distance labeling in graphs were shown to be efficient in rapidly estimating the shortest path distances in large graphs. Some other solutions, based on the usage of landmarks and sketches, which can be viewed as simplified distance labeling algorithms, were also proposed to efficiently estimate shortest paths. The general idea of these solutions is to store at each node information, like typically a set of vertices and the distance from this node to every one on these vertices, such that the distance queries are answered by using this information. Efficient indexing schemes will allow to optimize the computation cost and to efficiently access the huge amount of data.

Doctorant.e: Li Yifan