Not enough data to create a plot.
Try a different view from the menu above.
Dasgupta, Sanjoy
Framework for Evaluating Faithfulness of Local Explanations
Dasgupta, Sanjoy, Frost, Nave, Moshkovitz, Michal
Machine learning is an integral part of many human-facing computer systems and is increasingly a key component of decisions that have profound effects on people's lives. There are many dangers that come with this. For instance, statistical models can easily be error-prone in regions of the input space that are not well-reflected in training data but that end up arising in practice. Or they can be excessively complicated in ways that impact their generalization ability. Or they might implicitly make their decisions based on criteria that would not considered acceptable by society. For all these reasons, and many others, it is crucial to have models that are understandable or can explain their predictions to humans [19]. Explanations of a classification system can take many forms, but should accurately reflect the classifier's inner workings.
A Non-Parametric Test to Detect Data-Copying in Generative Models
Meehan, Casey, Chaudhuri, Kamalika, Dasgupta, Sanjoy
Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} -- where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets. For code \& examples, visit https://github.com/casey-meehan/data-copying
Explainable $k$-Means and $k$-Medians Clustering
Dasgupta, Sanjoy, Frost, Nave, Moshkovitz, Michal, Rashtchian, Cyrus
Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $\Omega(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.
Optimal rates for k-NN density and mode estimation
Dasgupta, Sanjoy, Kpotufe, Samory
We present two related contributions of independent interest: (1) high-probability finite sample rates for $k$-NN density estimation, and (2) practical mode estimators -- based on $k$-NN -- which attain minimax-optimal rates under surprisingly general distributional conditions. Papers published at the Neural Information Processing Systems Conference.
An algorithm for L1 nearest neighbor search via monotonic embedding
Wang, Xinan, Dasgupta, Sanjoy
Fast algorithms for nearest neighbor (NN) search have in large part focused on L2 distance. Here we develop an approach for L1 distance that begins with an explicit and exact embedding of the points into L2. We show how this embedding can efficiently be combined with random projection methods for L2 NN search, such as locality-sensitive hashing or random projection trees. We rigorously establish the correctness of the methodology and show by experimentation that it is competitive in practice with available alternatives. Papers published at the Neural Information Processing Systems Conference.
Incremental Clustering: The Case for Extra Clusters
Ackerman, Margareta, Dasgupta, Sanjoy
The explosion in the amount of data available for analysis often necessitates a transition from batch to incremental clustering methods, which process one element at a time and typically store only a small subset of the data. In this paper, we initiate the formal analysis of incremental clustering methods focusing on the types of cluster structure that they are able to detect. We find that the incremental setting is strictly weaker than the batch model, proving that a fundamental class of cluster structures that can readily be detected in the batch setting is impossible to identify using any incremental method. Furthermore, we show how the limitations of incremental clustering can be overcome by allowing additional clusters. Papers published at the Neural Information Processing Systems Conference.
Interactive Topic Modeling with Anchor Words
Dasgupta, Sanjoy, Poulis, Stefanos, Tosh, Christopher
The formalism of anchor words has enabled the development of fast topic modeling algorithms with provable guarantees. In this paper, we introduce a protocol that allows users to interact with anchor words to build customized and interpretable topic models. Experimental evidence validating the usefulness of our approach is also presented.
An adaptive nearest neighbor rule for classification
Balsubramani, Akshay, Dasgupta, Sanjoy, Freund, Yoav, Moran, Shay
We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. The choice of $k$ depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger $k$ for predicting the labels of points in noisy regions.) We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, $k$-NN with an optimal choice of $k$. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest.
What relations are reliably embeddable in Euclidean space?
Bhattacharjee, Robi, Dasgupta, Sanjoy
We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.
Learning from discriminative feature feedback
Dasgupta, Sanjoy, Dey, Akansha, Roberts, Nicholas, Sabato, Sivan
We consider the problem of learning a multi-class classifier from labels as well as simple explanations that we call discriminative features. We show that such explanations can be provided whenever the target concept is a decision tree, or can be expressed as a particular type of multi-class DNF formula. We present an efficient online algorithm for learning from such feedback and we give tight bounds on the number of mistakes made during the learning process. These bounds depend only on the representation size of the target concept and not on the overall number of available features, which could be infinite. We also demonstrate the learning procedure experimentally.