Plotting

 Liu, Jiashuo


Error Slice Discovery via Manifold Compactness

arXiv.org Artificial Intelligence

Despite the great performance of deep learning models in many areas, they still make mistakes and underperform on certain subsets of data, i.e. error slices. Given a trained model, it is important to identify its semantically coherent error slices that are easy to interpret, which is referred to as the error slice discovery problem. However, there is no proper metric of slice coherence without relying on extra information like predefined slice labels. Current evaluation of slice coherence requires access to predefined slices formulated by metadata like attributes or subclasses. Its validity heavily relies on the quality and abundance of metadata, where some possible patterns could be ignored. Besides, current algorithms cannot directly incorporate the constraint of coherence into their optimization objective due to the absence of an explicit coherence metric, which could potentially hinder their effectiveness. In this paper, we propose manifold compactness, a coherence metric without reliance on extra information by incorporating the data geometry property into its design, and experiments on typical datasets empirically validate the rationality of the metric. Then we develop Manifold Compactness based error Slice Discovery (MCSD), a novel algorithm that directly treats risk and coherence as the optimization objective, and is flexible to be applied to models of various tasks. Extensive experiments on the benchmark and case studies on other typical datasets demonstrate the superiority of MCSD.


Towards Human-Guided, Data-Centric LLM Co-Pilots

arXiv.org Machine Learning

Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.


LLM Embeddings Improve Test-time Adaptation to Tabular $Y|X$-Shifts

arXiv.org Machine Learning

For tabular datasets, the change in the relationship between the label and covariates ($Y|X$-shifts) is common due to missing variables (a.k.a. confounders). Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate $Y|X$-shifts, and propose to leverage the prior world knowledge in LLMs by serializing (write down) the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted/finetuned to the target domain even using 32 labeled observations. Our finding is based on a comprehensive and systematic study consisting of 7650 source-target pairs and benchmark against 261,000 model configurations trained by 22 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies. The code is available at https://github.com/namkoong-lab/LLM-Tabular-Shifts.


Topology-Aware Dynamic Reweighting for Distribution Shifts on Graph

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) have been widely used in node classification tasks, such as advertising recommendation [15], social network anomaly detection [34], etc. However, these GNN models typically assume that the training and test graph data are drawn from the same distribution, which does not always hold in practice. In real-world graph data, sample selection bias [8, 12] as well as graph construction techniques [27, 43] often brings distribution shifts between training nodes and test nodes. For instance, In WebKB [26] datasets, web pages (nodes) and categories (labels) are heavily affected by the university they originate from, leading to distribution shifts among nodes drawn from different universities. Therefore, in order to enhance the practical validity of GNNs, it is of paramount importance to deal with distribution shifts on graph data. To address the distribution shift problem in node classification, recent works [18, 36, 32, 37, 23] borrow the idea of invariant learning methods from the literature of out-of-distribution (OOD) generalization and adopt them on graph-structured data. Invariant learning [1, 19] stems from the causal inference literature, and now becomes one of the key approaches to solving OOD problems on graphs. The core concept is to identify invariant features with stable prediction mechanisms across different environments, thereby mitigating performance degradation under distribution shifts. And most of the works in this line directly apply existing invariant learning algorithms to graph-level classification tasks (major) [18, 32, 23, 41] and node classification tasks (minor) [36, 38].


Bridging Multicalibration and Out-of-distribution Generalization Beyond Covariate Shift

arXiv.org Artificial Intelligence

We establish a new model-agnostic optimization framework for out-of-distribution generalization via multicalibration, a criterion that ensures a predictor is calibrated across a family of overlapping groups. Multicalibration is shown to be associated with robustness of statistical inference under covariate shift. We further establish a link between multicalibration and robustness for prediction tasks both under and beyond covariate shift. We accomplish this by extending multicalibration to incorporate grouping functions that consider covariates and labels jointly. This leads to an equivalence of the extended multicalibration and invariance, an objective for robust learning in existence of concept shift. We show a linear structure of the grouping function class spanned by density ratios, resulting in a unifying framework for robust learning by designing specific grouping functions. We propose MC-Pseudolabel, a post-processing algorithm to achieve both extended multicalibration and out-of-distribution generalization. The algorithm, with lightweight hyperparameters and optimization through a series of supervised regression steps, achieves superior performance on real-world datasets with distribution shift.


Stability Evaluation via Distributional Perturbation Analysis

arXiv.org Machine Learning

The performance of learning models often deteriorates when deployed in out-of-sample environments. To ensure reliable deployment, we propose a stability evaluation criterion based on distributional perturbations. Conceptually, our stability evaluation criterion is defined as the minimal perturbation required on our observed dataset to induce a prescribed deterioration in risk evaluation. In this paper, we utilize the optimal transport (OT) discrepancy with moment constraints on the \textit{(sample, density)} space to quantify this perturbation. Therefore, our stability evaluation criterion can address both \emph{data corruptions} and \emph{sub-population shifts} -- the two most common types of distribution shifts in real-world scenarios. To further realize practical benefits, we present a series of tractable convex formulations and computational methods tailored to different classes of loss functions. The key technical tool to achieve this is the strong duality theorem provided in this paper. Empirically, we validate the practical utility of our stability evaluation criterion across a host of real-world applications. These empirical studies showcase the criterion's ability not only to compare the stability of different learning models and features but also to provide valuable guidelines and strategies to further improve models.


Towards Robust Out-of-Distribution Generalization Bounds via Sharpness

arXiv.org Artificial Intelligence

Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD generalization, still lacks appropriate theoretical guarantees. Canonical OOD bounds focus on different distance measurements between source and target domains but fail to consider the optimization property of the learned model. As empirically shown in recent work, the sharpness of learned minima influences OOD generalization. To bridge this gap between optimization and OOD generalization, we study the effect of sharpness on how a model tolerates data change in domain shift which is usually captured by "robustness" in generalization. In this paper, we give a rigorous connection between sharpness and robustness, which gives better OOD guarantees for robust algorithms. It also provides a theoretical backing for "flat minima leads to better OOD generalization". Overall, we propose a sharpness-based OOD generalization bound by taking robustness into consideration, resulting in a tighter bound than non-robust guarantees. Our findings are supported by the experiments on a ridge regression model, as well as the experiments on deep learning classification tasks.


A Survey on Evaluation of Out-of-Distribution Generalization

arXiv.org Artificial Intelligence

Machine learning models, while progressively advanced, rely heavily on the IID assumption, which is often unfulfilled in practice due to inevitable distribution shifts. This renders them susceptible and untrustworthy for deployment in risk-sensitive applications. Such a significant problem has consequently spawned various branches of works dedicated to developing algorithms capable of Out-of-Distribution (OOD) generalization. Despite these efforts, much less attention has been paid to the evaluation of OOD generalization, which is also a complex and fundamental problem. Its goal is not only to assess whether a model's OOD generalization capability is strong or not, but also to evaluate where a model generalizes well or poorly. This entails characterizing the types of distribution shifts that a model can effectively address, and identifying the safe and risky input regions given a model. This paper serves as the first effort to conduct a comprehensive review of OOD evaluation. We categorize existing research into three paradigms: OOD performance testing, OOD performance prediction, and OOD intrinsic property characterization, according to the availability of test data. Additionally, we briefly discuss OOD evaluation in the context of pretrained models. In closing, we propose several promising directions for future research in OOD evaluation.


Enhancing Distributional Stability among Sub-populations

arXiv.org Artificial Intelligence

Enhancing the stability of machine learning algorithms under distributional shifts is at the heart of the Out-of-Distribution (OOD) Generalization problem. Derived from causal learning, recent works of invariant learning pursue strict invariance with multiple training environments. Although intuitively reasonable, strong assumptions on the availability and quality of environments are made to learn the strict invariance property. In this work, we come up with the ``distributional stability" notion to mitigate such limitations. It quantifies the stability of prediction mechanisms among sub-populations down to a prescribed scale. Based on this, we propose the learnability assumption and derive the generalization error bound under distribution shifts. Inspired by theoretical analyses, we propose our novel stable risk minimization (SRM) algorithm to enhance the model's stability w.r.t. shifts in prediction mechanisms ($Y|X$-shifts). Experimental results are consistent with our intuition and validate the effectiveness of our algorithm. The code can be found at https://github.com/LJSthu/SRM.


Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications

arXiv.org Artificial Intelligence

Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.