Plotting

 Chidambaram, Muthu


Humanity's Last Exam

arXiv.org Artificial Intelligence

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 3,000 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.


What does guidance do? A fine-grained analysis in a simple setting

arXiv.org Machine Learning

The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment.


Reassessing How to Compare and Improve the Calibration of Machine Learning Models

arXiv.org Machine Learning

Standard machine learning models are trained to predict probability distributions over a set of possible actions or outcomes. Model-based decision-making is then typically done by using the action or outcome associated with the highest probability, and ideally one would like to interpret the model-predicted probability as a notion of confidence in the predicted action/outcome. In order for this confidence interpretation to be valid, it is crucial that the predicted probabilities are calibrated (Lichtenstein et al., 1982; Dawid, 1982; DeGroot & Fienberg, 1983), or accurately reflect the true frequencies of the outcome conditional on the prediction. As an informal (classic) example, a calibrated weather prediction model would satisfy the property that we observe rain 80% of the time on days for which our model predicted a 0.8 probability of rain. As the applications of machine learning models - particularly deep learning models - continue to expand to include high-stakes areas such as medical image diagnoses (Mehrtash et al., 2019; Elmarakeby et al., 2021; Nogales et al., 2021) and self-driving cars (Hu et al., 2023), so too does the importance of having calibrated model probabilities. Unfortunately, the seminal empirical investigation of Guo et al. (2017) demonstrated that deep learning models can be poorly calibrated, largely due to overconfidence. This observation has led to a number of follow-up works intended to improve model calibration using both training-time (Thulasidasan et al., 2019; Müller et al., 2020; Wang et al., 2021) and post-training methods (Joy et al., 2022; Gupta & Ramdas, 2022). Comparing these proposed improvements, however, is non-trivial due to the fact that the measurement of calibration in practice is itself an active area of research (Nixon et al.,


How Flawed is ECE? An Analysis via Logit Smoothing

arXiv.org Artificial Intelligence

The prevalence of machine learning across domains has increased drastically over the past few years, spurred by significant breakthroughs in deep learning for computer vision (Ramesh et al., 2022) and language modeling (Brown et al., 2020; OpenAI, 2023; Touvron et al., 2023). Consequently, the underlying deep learning models are increasingly being evaluated for critical use cases such as predicting medical diagnoses (Elmarakeby et al., 2021; Nogales et al., 2021) and self-driving (Hu et al., 2023). In these latter cases, due to the risk associated with incorrect decision-making, it is crucial not only that the models be accurate, but also that they have proper predictive uncertainty. This desideratum is formalized via the notion of calibration (Dawid, 1982; DeGroot & Fienberg, 1983), which codifies how well the model-predicted probabilities for events reflect their true frequencies conditional on the predictions. For example, in a medical context, a model that yields the correct diagnosis for a patient 95% of the time when it predicts a probability of 0.95 for that diagnosis can be considered to be calibrated. The analysis of whether modern deep learning models are calibrated can be traced back to the influential work of Guo et al. (2017), which showed that recent models exhibit calibration issues not present in earlier models; in particular, they are overconfident when they are incorrect.


For Better or For Worse? Learning Minimum Variance Features With Label Augmentation

arXiv.org Artificial Intelligence

The training and fine-tuning procedures for current state-of-the-art (SOTA) computer vision models rely on a number of different data augmentation schemes applied in tandem (Yu et al., 2022; Wortsman et al., 2022; Dehghani et al., 2023). While some of these methods involve only transformations to the input training data - such as random crops and rotations (Cubuk et al., 2019) - a non-trivial subset of them also apply transformations to the input training label. Perhaps the two most widely applied data augmentation methods in this subcategory are label smoothing (Szegedy et al., 2015) and Mixup (Zhang et al., 2018). Label smoothing replaces the one-hot encoded labels in the training data with smoothed out labels that assign non-zero probability to every possible class (see Section 2 for a formal definition). Mixup similarly smooths out the training labels, but does so via introducing random convex combinations of data points and their labels. As a result, Mixup modifies not only the training labels but also the training inputs.


On the Limitations of Temperature Scaling for Distributions with Overlaps

arXiv.org Machine Learning

Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise. The past decade has seen a rapid increase in the prevalence of deep learning models across a variety of applications, in large part due to their impressive predictive accuracy on unseen test data.


Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup

arXiv.org Artificial Intelligence

Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have multiple features.


Hiding Data Helps: On the Benefits of Masking for Sparse Coding

arXiv.org Artificial Intelligence

Sparse coding, which refers to modeling a signal as sparse linear combinations of the elements of a learned dictionary, has proven to be a successful (and interpretable) approach in applications such as signal processing, computer vision, and medical imaging. While this success has spurred much work on provable guarantees for dictionary recovery when the learned dictionary is the same size as the ground-truth dictionary, work on the setting where the learned dictionary is larger (or over-realized) with respect to the ground truth is comparatively nascent. Existing theoretical results in this setting have been constrained to the case of noise-less data. We show in this work that, in the presence of noise, minimizing the standard dictionary learning objective can fail to recover the elements of the ground-truth dictionary in the over-realized regime, regardless of the magnitude of the signal in the data-generating process. Furthermore, drawing from the growing body of work on self-supervised learning, we propose a novel masking objective for which recovering the ground-truth dictionary is in fact optimal as the signal increases for a large class of data-generating processes. We corroborate our theoretical results with experiments across several parameter regimes showing that our proposed objective also enjoys better empirical performance than the standard reconstruction objective.


Towards Understanding the Data Dependency of Mixup-style Training

arXiv.org Artificial Intelligence

In the Mixup training paradigm, a model is trained using convex combinations of data points and their associated labels. Despite seeing very few true data points during training, models trained using Mixup seem to still minimize the original empirical risk and exhibit better generalization and robustness on various tasks when compared to standard training. In this paper, we investigate how these benefits of Mixup training rely on properties of the data in the context of classification. For minimizing the original empirical risk, we compute a closed form for the Mixup-optimal classification, which allows us to construct a simple dataset on which minimizing the Mixup loss can provably lead to learning a classifier that does not minimize the empirical loss on the data. On the other hand, we also give sufficient conditions for Mixup training to also minimize the original empirical risk. For generalization, we characterize the margin of a Mixup classifier, and use this to understand why the decision boundary of a Mixup classifier can adapt better to the full structure of the training data when compared to standard training. In contrast, we also show that, for a large class of linear models and linearly separable datasets, Mixup training leads to learning the same classifier as standard training.