Projet de recherche doctoral numero :4576


Date depot: 1 janvier 1900
Titre: Bringing transparency to personalized services through statistical inference
Directeur de thèse: Davide BALZAROTTI (Eurecom)
Directeur de thèse: Patrick LOISEAU (LIG)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: Personalized services are online services that use information about their users to offer to each user a service that is more adapted to her. With the proliferation of personal data over the Internet, personalized services have become omnipresent in our daily life, including for instance all services offering recommendations. Although this data-based personalization has increased the utility of services for users and for service providers, it has also raised privacy concerns that became increasingly serious in recent years. One example of personalized service for which this issue is particularly stringent is targeted advertising. Advertisement is the main source of revenue for many free web services such as Facebook and Google. The ad ecosystem is complex and can be composed of many actors; here we abstract away this complexity and we refer to the whole chain of organizations that are responsible for sending an ad (e.g., companies that want to advertise, data brokers, advertising platforms) as the ad engine. The prominent advertisement model today is payper- click, which has led to an increasing amount of targeted advertising to increase the likelihood that a user clicks on an ad. Targeted advertising has increased advertisement revenues significantly. However, targeted advertising has been also raising more and more concerns from users who often feel that it constitutes an invasion of their private sphere. In particular, users often wonder “what data do advertisers have about me?” or “why am I being shown this ad?”. In a nutshell, users’ concerns are mainly kindled by the lack of transparency of current targeted advertising systems. The main objective of this thesis is to increase the transparency of targeted advertising by providing users with tools and methods to understand why they are targeted with a particular ad, to infer what information the ad engines possibly have about them, and ultimately to control it. Concretely, we propose to build a browser plugin that collects the ads shown to a user and provides her with analytics about these ads and tools to control them. The browser plugin can either give information for a particular ad such as “you are being shown this ad because the ad engine likely thinks that you are a student” or give analytics on a longer term such as “given the ads you have been shown in the last 3 months, the ad engine likely thinks that you earn less than $50k per year”. One of the main challenges to build such a tool is to infer the information that the ad engine knows about a user from the ads received. To explain our approach we abstract the system into three components: the information the ad engine collects about a user either online from tracking, or offline from data brokers (inputs), the ad engine that processes the inputs to put users in certain marketing categories (the black box), and the ads sent to the user (outputs). In this thesis, we propose to observe only the outputs and to infer the categories the user was put in by the ad engine, regardless of whether this was due to a particular input or not. In order to do that, we will simply collect the ads users receive, then group together all the users that received the same ad, and look at the most common demographics and interests of users in the group. We detail in Section B.1.b. the methods that we propose to develop to do this statistical inference task. The main novelty of our technique is that it relies only on the output, i.e., the ads observed by users and not on any input data the users may have Thesis description Athanasios Andreou (EURECOM), directeur: Patrick Loiseau (EURECOM) __________________________________________________________________________________ explicitly given. This makes our approach much more realistic. Then, we propose ways to control the information services have about a user by noise addition rather than by trying to directly block leakage of information, which is also a much more realistic process. 2. History and related work Previous works made a number of contributions either by discovering problems [2], or by proposing methods to bring more transparency to the ad ecosystem [1, 3, 4, 2]. We focus on the studies that are the closest to our proposal and refer the reader to [5] for an overview. Two studies [1, 4] proposed techniques to detect whether an ad is contextual, re-targeted or behavioral. While this is an important first step for transparency, the studies did not take the next step to detect why the ads are being targeted. Towards this direction, two studies proposed techniques to see how the activities of a user influence the ads she receives [3, 2]. At a high level, these approaches monitor the input of users (e.g., the emails users receive and send, the videos users see on youtube, the sites users visit) and they propose methods to estimate the likelihood that a given ad was shown due to a given input. Thus, these studies look at the inputs and outpu

Doctorant.e: Andreou Athanasios