Goto

Collaborating Authors

 abs


RADAR: Benchmarking Language Models on Imperfect Tabular Data

Neural Information Processing Systems

Language models (LMs) are increasingly being deployed to perform autonomous data analyses. However, their data awareness--the ability to recognize, reason over, and appropriately handle data artifacts such as missing values, outliers, and logical inconsistencies--remains underexplored. These artifacts are especially common in real-world tabular data and, if mishandled, can significantly compromise the validity of analytical conclusions. To address this gap, we present RADAR, a benchmark for systematically evaluating data-aware reasoning on tabular data. We develop a framework to simulate data artifacts via programmatic perturbations to enable targeted evaluation of model behavior. RADAR comprises 2980 table query pairs, grounded in real-world data spanning 9 domains and 5 data artifact types. In addition to evaluating artifact handling, RADAR systematically varies table size to study how reasoning performance holds when increasing table size. Our evaluation reveals that, despite decent performance on tables without data artifacts, frontier models degrade significantly when data artifacts are introduced, exposing critical gaps in their capacity for robust, data-aware analysis. Designed to be flexible and extensible, RADAR supports diverse perturbation types and controllable table sizes, offering a valuable resource for advancing tabular reasoning.1


Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

arXiv.org Machine Learning

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.


Learning stochastic multiscale models through normalizing flows

arXiv.org Machine Learning

Many systems in physics, engineering, and biology exhibit multiscale stochastic dynamics, where low-dimensional slow variables evolve under the influence of high-dimensional fast processes. In practice, observations are often limited to a single trajectory of the slow component, while the fast dynamics remain unobserved, making statistical learning challenging. Approaches based on partial differential equations (PDE), such as Fokker-Planck formulations, aim to characterize the evolution of probability densities, typically requiring dense space-time data or grid-based solvers. In contrast, we adopt a trajectory-based perspective and develop a data-driven framework for learning effective stochastic dynamics from a single observed path. We model the dynamics by coupled multiscale stochastic differential equations (SDEs) and first obtain a principled model reduction through stochastic averaging. Unlike generic model reduction techniques such as PCA, this respects the dynamical structure of the original system and explicitly incorporates the interaction between slow and fast scales. A central challenge, however, is that the reduced model depends on the invariant distribution of the fast process, which is a solution to an intractable and often unknown PDE. We introduce a novel learning framework that parameterizes the invariant distribution using normalizing flows, enabling expressive density modeling in the latent fast-variable space. The flow is trained end-to-end by optimizing a penalized likelihood objective induced by the reduced stochastic dynamics. Furthermore, we develop a Bayesian variational inference procedure for uncertainty quantification, employing a second normalizing flow to approximate the posterior distribution over model parameters. This yields a scalable approach to capturing epistemic uncertainty in multiscale systems.







Theory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

arXiv.org Machine Learning

Large language models (LLMs) have achieved remarkable performance but face critical challenges: hallucinations and high inference costs. Leveraging multiple experts offers a solution: deferring uncertain inputs to more capable experts improves reliability, while routing simpler queries to smaller, distilled models enhances efficiency. This motivates the problem of learning with multiple-expert deferral. This thesis presents a comprehensive study of this problem and the related problem of learning with abstention, supported by strong consistency guarantees. First, for learning with abstention (a special case of deferral), we analyze score-based and predictor-rejector formulations in multi-class classification. We introduce new families of surrogate losses and prove strong non-asymptotic, hypothesis set-specific consistency guarantees, resolving two existing open questions. We analyze both single-stage and practical two-stage settings, with experiments on CIFAR-10, CIFAR-100, and SVHN demonstrating the superior performance of our algorithms. Second, we address general multi-expert deferral in classification. We design new surrogate losses for both single-stage and two-stage scenarios and prove they benefit from strong $H$-consistency bounds. For the two-stage scenario, we show that our surrogate losses are realizable $H$-consistent for constant cost functions, leading to effective new algorithms. Finally, we introduce a novel framework for regression with deferral to address continuous label spaces. Our versatile framework accommodates multiple experts and various cost structures, supporting both single-stage and two-stage methods. It subsumes recent work on regression with abstention. We propose new surrogate losses with proven $H$-consistency and demonstrate the empirical effectiveness of the resulting algorithms.


An Additive Manufacturing Part Qualification Framework: Transferring Knowledge of Stress-strain Behaviors from Additively Manufactured Polymers to Metals

arXiv.org Artificial Intelligence

Part qualification is crucial in additive manufacturing (AM) because it ensures that additively manufactured parts can be consistently produced and reliably used in critical applications. Part qualification aims at verifying that an additively manufactured part meets performance requirements; therefore, predicting the complex stress-strain behaviors of additively manufactured parts is critical. We develop a dynamic time warping (DTW)-transfer learning (TL) framework for additive manufacturing part qualification by transferring knowledge of the stress-strain behaviors of additively manufactured low-cost polymers to metals. Specifically, the framework employs DTW to select a polymer dataset as the source domain that is the most relevant to the target metal dataset. Using a long short-term memory (LSTM) model, four source polymers (i.e., Nylon, PLA, CF-ABS, and Resin) and three target metals (i.e., AlSi10Mg, Ti6Al4V, and carbon steel) that are fabricated by different AM techniques are utilized to demonstrate the effectiveness of the DTW-TL framework. Experimental results show that the DTW-TL framework identifies the closest match between polymers and metals to select one single polymer dataset as the source domain. The DTW-TL model achieves the lowest mean absolute percentage error of 12.41% and highest coefficient of determination of 0.96 when three metals are used as the target domain, respectively, outperforming the vanilla LSTM model without TL as well as the TL model pre-trained on four polymer datasets as the source domain.