AITopics

2605.28767

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

arXiv.org Machine LearningMay-1-2026

Optimized Deferral for Imbalanced Settings

Cortes, Corinna, Mao, Anqi, Mohri, Mehryar, Zhong, Yutao

Learning algorithms can be significantly improved by routing complex or uncertain inputs to specialized experts, balancing accuracy with computational cost. This approach, known as learning to defer, is essential in domains like natural language generation, medical diagnosis, and computer vision, where an effective deferral can reduce errors at low extra resource consumption. However, the two-stage learning to defer setting, which leverages existing predictors such as a collection of LLMs or other classifiers, often faces challenges due to an expert imbalance problem. This imbalance can lead to suboptimal performance, with deferral algorithms favoring the majority expert. We present a comprehensive study of two-stage learning to defer in expert imbalance settings. We cast the deferral loss optimization as a novel cost-sensitive learning problem over the input-expert domain. We derive new margin-based loss functions and guarantees tailored to this setting, and develop novel algorithms for cost-sensitive learning. Leveraging these results, we design principled deferral algorithms, MILD (Margin-based Imbalanced Learning to Defer), specifically suited for expert imbalance settings. Extensive experiments demonstrate the effectiveness of our approach, showing clear improvements over existing baselines on both image classification and real-world Large Language Model (LLM) routing tasks.

large language model, machine learning, natural language, (16 more...)

2604.27723

Country: North America (0.46)

Genre: Research Report (0.40)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Mohri, Mehryar, Zhong, Yutao

Mind the Gap: Structure-Aware Consistency in Preference Learning

arXiv.org Machine LearningMay-1-2026

Abstractsurrogate loss (e.g., the logistic loss) as a proxy for the true objective: the non-convex, discontinuous 0-1 ranking Preference learning has become the foundationloss. This reliance raises a fundamental theoretical question of aligning Large Language Models (LLMs) withthat remains largely unanswered for deep networks: Does human intent. Popular methods, such as Direct Preference Optimization (DPO), minimize surrominimizing these surrogate losses actually guarantee the minimization of the true ranking error? However, we demonstrate that for In this work, we investigate this question through the lens the equicontinuous hypothesis sets typical of neu-of H-consistency (Mao, Mohri, and Zhong, 2023e). We ral networks, these standard surrogates are theo-formulate LLM preference learning as a pairwise ranking retically inconsistent, yielding vacuous general-problem and derive a series of results that bridge the gap between learning theory and practical fine-tuning. To resolve this, we formulate LLM alignment within a margin-shifted rankingwe identify a fundamental theoretical deficiency in standard framework. We demonstrate that for equicontinuous hypothbounds that depend on enforcing a separationesis sets, a property satisfied by neural networks, standard margin γ. Crucially, we extend this to Structure-surrogate minimization yields vacuous consistency guaranAware H-consistency, introducing a novel ob-tees. Specifically, without explicit constraints, a model can achieve arbitrarily low surrogate risk while maintaining ajective (SA-DPO) that adapts the margin based on the semantic distance between responses tohigh ranking error, effectively "cheating" the objective by handle synonyms and hard pairs. Finally, weshrinking score differences rather than learning the correct analyze the trade-off between consistency andordering. We prove that enforcing a confidence the Polynomial Hinge family) offer superior con-gap γ is not merely a heuristic, but a strict requirement for sistency guarantees for capacity-bounded models H-consistency in the deep learning regime. However, while compared to the standard logistic loss used in DPO. a uniform margin restores consistency, it is a blunt instrument. We show that demanding a large, fixed margin on semantically identical pairs (synonyms) forces the model to hallucinate differences where none exist, introducing bias 1. Introductionand instability. To address this, we propose Structure-Aware H-consistency and a corresponding objective, StructureThe alignment of Large Language Models (LLMs) has shifted from explicit Reward Modeling (Stiennon et al., Aware DPO (SA-DPO).

large language model, machine learning, natural language, (16 more...)

2604.27733

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Mohri, Mehryar, Zhong, Yutao

Linear-Core Surrogates: Smooth Loss Functions with Linear Rates for Classification and Structured Prediction

arXiv.org Machine LearningMay-1-2026

The choice of loss function in classification involves a fundamental trade-off: smooth losses (like Cross-Entropy) enable fast optimization rates but yield slow square-root consistency bounds, while piecewise-linear losses (like Hinge) offer fast linear consistency rates but suffer from non-differentiability. We propose Linear-Core (LC) Surrogates, a new family of convex loss functions that resolve this tension by stitching a linear core to a smooth tail. We prove that these surrogates are differentiable everywhere while retaining strict linear $H$-consistency bounds, effectively combining the optimization benefits of smoothness with the statistical efficiency of margin-based losses. In the structured prediction setting, we show that this smoothness unlocks a massive computational and energy advantage: it allows for an unbiased stochastic gradient estimator that bypasses the quadratic complexity $O(|\mathscr{Y}|^2)$ of exact inference (e.g., Viterbi). Empirically, our method achieves a 23$\times$ speedup over Structured SVMs on large-vocabulary sequence tagging tasks and demonstrates superior robustness to instance-dependent label noise, outperforming Cross-Entropy by 2.6% on corrupted CIFAR-10.

artificial intelligence, inductive learning, machine learning, (12 more...)

2604.27742

Country: North America (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Neural Information Processing SystemsApr-24-2026, 20:17:35 GMT

0e441913d4fa486c3eec967d79750b13-Supplemental-Conference.pdf

artificial intelligence, cstnd, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsApr-24-2026, 20:17:31 GMT

H-Consistency Bounds: Characterization and Extensions

A series of recent publications by Awasthi, Mao, Mohri, and Zhong [2022b] have introduced the key notion of H-consistency bounds for surrogate loss functions. These are upper bounds on the zero-one estimation error of any predictor in a hypothesis set, expressed in terms of its surrogate loss estimation error. They are both non-asymptotic and hypothesis set-specific and thus stronger and more informative than Bayes-consistency. However, determining if they hold and deriving these bounds have required a specific proof and analysis for each surrogate loss. Can we derive more general tools and characterizations?

artificial intelligence, h-consistency, machine learning, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-16-2026, 09:02:37 GMT

Realizable H-Consistent and Bayes-Consistent Loss Functions for Learning to Defer

We present a comprehensive study of surrogate loss functions for learning to defer.

artificial intelligence, machine learning, natural language, (15 more...)

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Natural Language (0.68)

Neural Information Processing SystemsFeb-15-2026, 07:29:17 GMT

Efficient Gradient Computation for Structured Output Learning with Rational and Tropical Losses

Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, Dmitry Storcheus, Scott Yang

Many of these algorithms have been successfully used with specific loss functions such as the Hamming loss. Their use has been also extended to multivariate performance measures such as Precision/Recall orF1-score (Joachims,2005),which depend onpredictions onalltraining points.

artificial intelligence, machine learning, natural language, (19 more...)