Ever heard of the "Britney Spears problem"? Contrary to what it sounds like, it's got nothing to do with the dalliances of the rich and famous. Rather, it's a computing puzzle related to data tracking: Precisely tailoring a data-rich service, like a search engine or fiber internet connection, to individual users hypothetically requires tracking every packet sent to and from the service provider, which needless to say isn't practical. To get around this, most companies leverage algorithms that make guesses about the frequency of data exchanged by hashing it (i.e., divvying it up into pieces). But this necessarily sacrifices nuance -- telling patterns that emerge naturally in large data volumes fly under the radar.
We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.
Recommender systems attempt to highlight items that a target user is likely to find interesting. A common technique is to use collaborative filtering (CF), where multiple users share information so as to provide each with effective recommendations. A key aspect of CF systems is finding users whose tastes accurately reflect the tastes of some target user. Typically, the system looks for other agents who have had experience with many of the items the target user has examined, and whose classification of these items has a strong correlation with the classifications of the target user. Since the universe of items may be enormous and huge data sets are involved, sophisticated methods must be used to quickly locate appropriate other agents. We present a method for quickly determining the proportional intersection between the items that each of two users has examined, by sending and maintaining extremely concise “sketches” of the list of items. These sketches enable the approximation of the proportional intersection within a distance of \epsilon, with a high probability of 1-\delta. Our sketching techniques are based on random minwise independent hash functions, and use very little space and time, so they are well-suited for use in large-scale collaborative filtering systems.
The notion of popularity is prevalent within society. We have made charts of the most popular music and movies since the early part of the 20th century. Elections and referenda are primarily decided by who gets the most votes. Within computer systems, we monitor followers and endorsements in social networks, and track views, hits, and connection attempts in other networks. Computationally, the problem of determining which items are popular appears at first a straightforward one.
A novel system developed by MIT researchers automatically "learns" how to schedule data-processing operations across thousands of servers--a task traditionally reserved for imprecise, human-designed algorithms. Doing so could help today's power-hungry data centers run far more efficiently. Data centers can contain tens of thousands of servers, which constantly run data-processing tasks from developers and users. Cluster scheduling algorithms allocate the incoming tasks across the servers, in real-time, to efficiently utilize all available computing resources and get jobs done fast. Traditionally, however, humans fine-tune those scheduling algorithms, based on some basic guidelines ("policies") and various tradeoffs.