Privacy-preserving Prediction Machine Learning

Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all differentially private prediction algorithms.

Prediction Algorithms in One Picture


Click here to find the original image, along with the article describing the various concepts.

Prediction Algorithms in One Picture


This infographics was produced by Dataiku. Click here to find the original image, along with the article describing the various concepts.

Crime Prediction Algorithms Aren't Very Good At Predicting Crimes

International Business Times

Some courts in the U.S., particularly in states from California to New Jersey, use crime-predicting algorithms to determine if a defendant is likely to commit another crime in the future. While the software helps judges decide who gets bail, who goes to jail and who can walk away free, it appears the technology isn't very reliable and opens doors to a more unfair justice system.



Prediction-Defined Aggregates (a.k.a Adaptive Cache 2.0) Based off of insights from a number of production deployments, we have been able to develop a series of algorithms that are able to anticipate certain query patterns a priori, and create aggregates to support them even before a single end-user query is executed against an AtScale virtual cube. At the core of Prediction Defined Aggregates is a statistics system that is constantly evaluating statistics - row counts, attribute cardinality, join quality - of underlying data sets. Once statistics are available, this information is fed into a series of algorithms that are able to predict the potential value of creating aggregate tables to satisfy anticipated query patterns. To see Prediction Defined aggregates in action you can check out the video below.