Ever heard of the "Britney Spears problem"? Contrary to what it sounds like, it's got nothing to do with the dalliances of the rich and famous. Rather, it's a computing puzzle related to data tracking: Precisely tailoring a data-rich service, like a search engine or fiber internet connection, to individual users hypothetically requires tracking every packet sent to and from the service provider, which needless to say isn't practical. To get around this, most companies leverage algorithms that make guesses about the frequency of data exchanged by hashing it (i.e., divvying it up into pieces). But this necessarily sacrifices nuance -- telling patterns that emerge naturally in large data volumes fly under the radar.
We introduce a new sub-linear space data structure---the Weight-Median Sketch---that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracy-memory trade-offs over alternatives, including count-based sketches and feature hashing.
Recommender systems attempt to highlight items that a target user is likely to find interesting. A common technique is to use collaborative filtering (CF), where multiple users share information so as to provide each with effective recommendations. A key aspect of CF systems is finding users whose tastes accurately reflect the tastes of some target user. Typically, the system looks for other agents who have had experience with many of the items the target user has examined, and whose classification of these items has a strong correlation with the classifications of the target user. Since the universe of items may be enormous and huge data sets are involved, sophisticated methods must be used to quickly locate appropriate other agents. We present a method for quickly determining the proportional intersection between the items that each of two users has examined, by sending and maintaining extremely concise “sketches” of the list of items. These sketches enable the approximation of the proportional intersection within a distance of \epsilon, with a high probability of 1-\delta. Our sketching techniques are based on random minwise independent hash functions, and use very little space and time, so they are well-suited for use in large-scale collaborative filtering systems.
Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions.
Learning to code involves recognizing how to structure a program, and how to fill in every last detail correctly. No wonder it can be so frustrating. A new program-writing AI, SketchAdapt, offers a way out. Trained on tens of thousands of program examples, SketchAdapt learns how to compose short, high-level programs, while letting a second set of algorithms find the right sub-programs to fill in the details. Unlike similar approaches for automated program-writing, SketchAdapt knows when to switch from statistical pattern-matching to a less efficient, but more versatile, symbolic reasoning mode to fill in the gaps.