Collaborating Authors

MIT CSAIL's machine learning algorithm helps predict patterns in large data streams


Ever heard of the "Britney Spears problem"? Contrary to what it sounds like, it's got nothing to do with the dalliances of the rich and famous. Rather, it's a computing puzzle related to data tracking: Precisely tailoring a data-rich service, like a search engine or fiber internet connection, to individual users hypothetically requires tracking every packet sent to and from the service provider, which needless to say isn't practical. To get around this, most companies leverage algorithms that make guesses about the frequency of data exchanged by hashing it (i.e., divvying it up into pieces). But this necessarily sacrifices nuance -- telling patterns that emerge naturally in large data volumes fly under the radar.

Finding Heavily-Weighted Features in Data Streams Machine Learning

We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.

Sketching Techniques for Collaborative Filtering

AAAI Conferences

Recommender systems attempt to highlight items that a target user is likely to find interesting. A common technique is to use collaborative filtering (CF), where multiple users share information so as to provide each with effective recommendations. A key aspect of CF systems is finding users whose tastes accurately reflect the tastes of some target user. Typically, the system looks for other agents who have had experience with many of the items the target user has examined, and whose classification of these items has a strong correlation with the classifications of the target user. Since the universe of items may be enormous and huge data sets are involved, sophisticated methods must be used to quickly locate appropriate other agents. We present a method for quickly determining the proportional intersection between the items that each of two users has examined, by sending and maintaining extremely concise “sketches” of the list of items. These sketches enable the approximation of the proportional intersection within a distance of \epsilon, with a high probability of 1-\delta. Our sketching techniques are based on random minwise independent hash functions, and use very little space and time, so they are well-suited for use in large-scale collaborative filtering systems.

Technical Perspective: The True Cost of Popularity

Communications of the ACM

The notion of popularity is prevalent within society. We have made charts of the most popular music and movies since the early part of the 20th century. Elections and referenda are primarily decided by who gets the most votes. Within computer systems, we monitor followers and endorsements in social networks, and track views, hits, and connection attempts in other networks. Computationally, the problem of determining which items are popular appears at first a straightforward one.

Practical Locally Private Heavy Hitters

Neural Information Processing Systems

We present new practical local differentially private heavy hitters algorithms achieving optimal or near-optimal worst-case error -- TreeHist and Bitstogram. In both algorithms, server running time is $\tilde O(n)$ and user running time is $\tilde O(1)$, hence improving on the prior state-of-the-art result of Bassily and Smith [STOC 2015] requiring $\tilde O(n^{5/2})$ server time and $\tilde O(n^{3/2})$ user time. With a typically large number of participants in local algorithms ($n$ in the millions), this reduction in time complexity, in particular at the user side, is crucial for the use of such algorithms in practice. We implemented Algorithm TreeHist to verify our theoretical analysis and compared its performance with the performance of Google's RAPPOR code.