Inductive Learning
Ensuring Actionable Recourse via Adversarial Training
Ross, Alexis, Lakkaraju, Himabindu, Bastani, Osbert
As machine learning models are increasingly deployed in high-stakes domains such as legal and financial decision-making, there has been growing interest in post-hoc methods for generating counterfactual explanations. Such explanations provide individuals adversely impacted by predicted outcomes (e.g., an applicant denied a loan) with "recourse" ---i.e., a description of how they can change their features to obtain a positive outcome. We propose a novel algorithm that leverages adversarial training and PAC confidence sets to learn models that theoretically guarantee recourse to affected individuals with high probability without sacrificing accuracy. To the best of our knowledge, our approach is the first to learn models for which recourses are guaranteed with high probability. Extensive experimentation with real world datasets spanning various applications including recidivism prediction, bail outcomes, and lending demonstrate the efficacy of the proposed framework.
Doing the impossible? Machine learning with less than one example - KDnuggets
"Less-than-one-shot learning" enables machine learning algorithms to classify N labels with less than N training examples. If I told you to imagine something between a horse and a bird--say, a flying horse--would you need to see a concrete example? Such a creature does not exist, but nothing prevents us from using our imagination to create one: the Pegasus. The human mind has all kinds of mechanisms to create new concepts by combining abstract and concrete knowledge it has of the real world. We can imagine existing things that we might have never seen (a horse with a long neck--a giraffe), as well as things that do not exist in real life (a winged serpent that breathes fire--a dragon).
Discrete solution pools and noise-contrastive estimation for predict-and-optimize
Mulamba, Maxime, Mandi, Jayanta, Diligenti, Michelangelo, Lombardi, Michele, Bucarey, Victor, Guns, Tias
Numerous real-life decision-making processes involve solving a combinatorial optimization problem with uncertain input that can be estimated from historic data. There is a growing interest in decision-focused learning methods, where the loss function used for learning to predict the uncertain input uses the outcome of solving the combinatorial problem over a set of predictions. Different surrogate loss functions have been identified, often using a continuous approximation of the combinatorial problem. However, a key bottleneck is that to compute the loss, one has to solve the combinatorial optimisation problem for each training instance in each epoch, which is computationally expensive even in the case of continuous approximations. We propose a different solver-agnostic method for decision-focused learning, namely by considering a pool of feasible solutions as a discrete approximation of the full combinatorial problem. Solving is now trivial through a single pass over the solution pool. We design several variants of a noise-contrastive loss over the solution pool, which we substantiate theoretically and empirically. Furthermore, we show that by dynamically re-solving only a fraction of the training instances each epoch, our method performs on par with the state of the art, whilst drastically reducing the time spent solving, hence increasing the feasibility of predict-and-optimize for larger problems.
Understanding Self-supervised Learning with Dual Deep Networks
Tian, Yuandong, Yu, Lantao, Chen, Xinlei, Ganguli, Surya
We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR with various loss functions (simple contrastive loss, soft Triplet loss and InfoNCE loss), the weights at each layer are updated by a covariance operator that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations. We show this leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the combination of Batch-Norm and a predictor network creates an implicit contrastive term, acting as an approximate covariance operator. Additionally, for linear architectures we derive exact solutions for BYOL that provide conceptual insights into how BYOL can learn useful non-collapsed representations without any contrastive terms that separate negative pairs. Extensive ablation studies justify our theoretical findings. Unlike supervised learning (SL) that deals with labeled data, SSL learns meaningful structures from randomly initialized networks without human-provided labels. In this paper, we propose a systematic theoretical analysis of SSL with deep ReLU networks. Our analysis imposes no parametric assumptions on the input data distribution and is applicable to stateof-the-art SSL methods that typically involve two parallel (or dual) deep ReLU networks during training (e.g., SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), etc). We do so by developing an analogy between SSL and a theoretical framework for analyzing supervised learning, namely the student-teacher setting (Tian, 2020; Allen-Zhu and Li, 2020; Lampinen and Ganguli, 2018; Saad and Solla, 1996), which also employs a pair of dual networks.
A Theory of Universal Learning
Bousquet, Olivier, Hanneke, Steve, Moran, Shay, van Handel, Ramon, Yehudayoff, Amir
How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its "learning curve", that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.
Towards Domain-Agnostic Contrastive Learning
Verma, Vikas, Luong, Minh-Thang, Kawaguchi, Kenji, Pham, Hieu, Le, Quoc V.
Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image cropping and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, data augmentation techniques, are not readily available. Key to our approach is the use of Mixup noise to create similar and dissimilar examples by mixing data samples differently either at the input or hidden-state levels. To demonstrate the effectiveness of DACL, we conduct experiments across various domains such as tabular data, images, and graphs. Our results show that DACL not only outperforms other domain-agnostic noising methods, such as Gaussian-noise, but also combines well with domain-specific methods, such as SimCLR, to improve self-supervised visual representation learning. Finally, we theoretically analyze our method and show advantages over the Gaussian-noise based contrastive learning approach.
Explaining Neural Matrix Factorization with Gradient Rollback
Lawrence, Carolin, Sztyler, Timo, Niepert, Mathias
Explaining the predictions of neural black-box models is an important problem, especially when such models are used in applications where user trust is crucial. Estimating the influence of training examples on a learned neural model's behavior allows us to identify training examples most responsible for a given prediction and, therefore, to faithfully explain the output of a black-box model. The most generally applicable existing method is based on influence functions, which scale poorly for larger sample sizes and models. We propose gradient rollback, a general approach for influence estimation, applicable to neural models where each parameter update step during gradient descent touches a smaller number of parameters, even if the overall number of parameters is large. Neural matrix factorization models trained with gradient descent are part of this model class. These models are popular and have found a wide range of applications in industry. Especially knowledge graph embedding methods, which belong to this class, are used extensively. We show that gradient rollback is highly efficient at both training and test time. Moreover, we show theoretically that the difference between gradient rollback's influence approximation and the true influence on a model's behavior is smaller than known bounds on the stability of stochastic gradient descent. This establishes that gradient rollback is robustly estimating example influence. We also conduct experiments which show that gradient rollback provides faithful explanations for knowledge base completion and recommender datasets.
US coronavirus cases set record, deaths rising -- with crisis central to Trump-Biden election battle
New confirmed cases of the coronavirus in the U.S. have climbed to an all-time high of more than 86,000 per day on average, in a glimpse of the worsening crisis that lies ahead for the winner of the presidential election. Cases and hospitalizations are setting records all around the country just as the holidays and winter approach, demonstrating the challenge that either President Donald Trump or former Vice President Joe Biden will face in the coming months. Daily new confirmed coronavirus cases in the U.S. have surged 45% over the past two weeks, to a record 7-day average of 86,352, according to data compiled by Johns Hopkins University. Deaths are also on the rise, up 15 percent to an average of 846 deaths every day. The total U.S. death toll is already more than 232,000, and total confirmed U.S. cases have surpassed 9 million.
There is no trade-off: enforcing fairness can improve accuracy
Maity, Subha, Mukherjee, Debarghya, Yurochkin, Mikhail, Sun, Yuekai
One of the main barriers to the broader adoption of algorithmic fairness in machine learning is the trade-off between fairness and performance of ML models: many practitioners are unwilling to sacrifice the performance of their ML model for fairness. In this paper, we show that this trade-off may not be necessary. If the algorithmic biases in an ML model are due to sampling biases in the training data, then enforcing algorithmic fairness may improve the performance of the ML model on unbiased test data. We study conditions under which enforcing algorithmic fairness helps practitioners learn the Bayes decision rule for (unbiased) test data from biased training data. We also demonstrate the practical implications of our theoretical results in real-world ML tasks.