Grossglauser, Matthias
Learning to Search Efficiently Using Comparisons
Chumbalov, Daniyar, Maystre, Lucas, Grossglauser, Matthias
We consider the problem of searching in a set of items by using pairwise comparisons. We aim to locate a target item $t$ by asking an oracle questions of the form "Which item from the pair $(i,j)$ is more similar to t?". We assume a blind setting, where no item features are available to guide the search process; only the oracle sees the features in order to generate an answer. Previous approaches for this problem either assume noiseless answers, or they scale poorly in the number of items, both of which preclude practical applications. In this paper, we present a new scalable learning framework called learn2search that performs efficient comparison-based search on a set of items despite the presence of noise in the answers. Items live in a space of latent features, and we posit a probabilistic model for the oracle comparing two items $i$ and $j$ with respect to a target $t$. Our algorithm maintains its own representation of the space of items, which it learns incrementally based on past searches. We evaluate the performance of learn2search on both synthetic and real-world data, and show that it learns to search more and more efficiently, over time matching the performance of a scheme with access to the item features.
Linear-Time Inference for Pairwise Comparisons with Gaussian-Process Dynamics
Maystre, Lucas, Kristof, Victor, Grossglauser, Matthias
In many competitive sports and games (such as tennis, basketball, chess and electronic sports), the most useful definition of a competitor's skill is the propensity of that competitor to win against an opponent. It is often difficult to measure this skill explicitly: take basketball for example, a team's skill depends on the abilities of its players in terms of shooting accuracy, physical fitness, mental preparation, but also on the team's cohesion and coordination, on its strategy, on the enthusiasm of its fans, and a number of other intangible factors. However, it is easy to observe this skill implicitly through the outcomes of matches. In this setting, probabilistic models of pairwise-comparison outcomes provide an elegant and effective approach to quantifying skill and to predicting future match outcomes given past data. These models, pioneered by Zermelo [1928] in the context of chess (and by Thurstone [1927] in the context of psychophysics), have been studied for almost a century. They posit that each competitor i (i.e., a team or player) is characterized by a latent score s R and that the outcome probabilities of a match between i and j are a function of
On the Performance of a Canonical Labeling for Matching Correlated Erd\H{o}s-R\'enyi Graphs
Dai, Osman Emre, Cullina, Daniel, Kiyavash, Negar, Grossglauser, Matthias
Graph matching in two correlated random graphs refers to the task of identifying the correspondence between vertex sets of the graphs. Recent results have characterized the exact information-theoretic threshold for graph matching in correlated Erd\H{o}s-R\'enyi graphs. However, very little is known about the existence of efficient algorithms to achieve graph matching without seeds. In this work we identify a region in which a straightforward $O(n^2\log n)$-time canonical labeling algorithm, initially introduced in the context of graph isomorphism, succeeds in matching correlated Erd\H{o}s-R\'enyi graphs. The algorithm has two steps. In the first step, all vertices are labeled by their degrees and a trivial minimum distance matching (i.e., simply sorting vertices according to their degrees) matches a fixed number of highest degree vertices in the two graphs. Having identified this subset of vertices, the remaining vertices are matched using a matching algorithm for bipartite graphs.
Can Who-Edits-What Predict Edit Survival?
Yardım, Ali Batuhan, Kristof, Victor, Maystre, Lucas, Grossglauser, Matthias
The Internet has enabled the emergence of massive online collaborative projects. As the number of contributors to these projects grows, it becomes increasingly important to understand and predict whether the edits that users make will eventually impact the project positively. Existing solutions either rely on a user reputation system or consist of a highly-specialized predictor tailored to a specific peer-production system. In this work, we explore a different point in the solution space, which does not involve any content-based feature of the edits. To this end, we formulate a statistical model of edit outcomes. We view each edit as a game between the editor and the component of the project. We posit that the probability of a positive outcome is a function of the editor's skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. Then, we consider Wikipedia and the Linux kernel, two examples of large-scale collaborative projects, and we seek to understand whether this simple model can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. Furthermore, inspecting the model parameters enables us to discover interesting structure in the data. Our method is simple to implement, computationally inexpensive, and it produces interpretable results; as such, we believe that it is a valuable tool to analyze collaborative systems.
ChoiceRank: Identifying Preferences from Node Traffic in Networks
Maystre, Lucas, Grossglauser, Matthias
Understanding how users navigate in a network is of high interest in many applications. We consider a setting where only aggregate node-level traffic is observed and tackle the task of learning edge transition probabilities. We cast it as a preference learning problem, and we study a model where choices follow Luce's axiom. In this case, the $O(n)$ marginal counts of node visits are a sufficient statistic for the $O(n^2)$ transition probabilities. We show how to make the inference problem well-posed regardless of the network's structure, and we present ChoiceRank, an iterative algorithm that scales to networks that contains billions of nodes and edges. We apply the model to two clickstream datasets and show that it successfully recovers the transition probabilities using only the network structure and marginal (node-level) traffic data. Finally, we also consider an application to mobility networks and apply the model to one year of rides on New York City's bicycle-sharing system.
Just Sort It! A Simple and Effective Approach to Active Preference Learning
Maystre, Lucas, Grossglauser, Matthias
We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.
Fast and Accurate Inference of Plackett–Luce Models
Maystre, Lucas, Grossglauser, Matthias
We show that the maximum-likelihood (ML) estimate of models derived from Luce's choice axiom (e.g., the Plackett-Luce model) can be expressed as the stationary distribution of a Markov chain. This conveys insight into several recently proposed spectral inference algorithms. We take advantage of this perspective and formulate a new spectral algorithm that is significantly more accurate than previous ones for the Plackett--Luce model. With a simple adaptation, this algorithm can be used iteratively, producing a sequence of estimates that converges to the ML estimate. The ML version runs faster than competing approaches on a benchmark of five datasets. Our algorithms are easy to implement, making them relevant for practitioners at large.