AITopics

2601.19055

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-6-2025

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Shi, Ruizhe, Song, Minhak, Zhou, Runlong, Zhang, Zihan, Fazel, Maryam, Du, Simon S.

We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model -- highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

artificial intelligence, machine learning, py 1, (19 more...)

2505.1977

Country: North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Akyildiz, O. Deniz, del Moral, Pierre, Miguez, Joaquin

On the contraction properties of Sinkhorn semigroups

arXiv.org Machine LearningMar-12-2025

We develop a novel semigroup contraction analysis based on Lyapunov techniques to prove the exponential convergence of Sinkhorn equations on weighted Banach spaces. This operator-theoretic framework yields exponential decays of Sinkhorn iterates towards Schr\"odinger bridges with respect to general classes of $\phi$-divergences as well as in weighted Banach spaces. To the best of our knowledge, these are the first results of this type in the literature on entropic transport and the Sinkhorn algorithm. We also illustrate the impact of these results in the context of multivariate linear Gaussian models as well as statistical finite mixture models including Gaussian-kernel density estimation of complex data distributions arising in generative models.

contraction coefficient, mq 1, py 1, (15 more...)

2503.09887

Country:

North America > United States > New York (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(3 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)

Zhou, Runlong, Fazel, Maryam, Du, Simon S.

Extragradient Preference Optimization (EGPO): Beyond Last-Iterate Convergence for Nash Learning from Human Feedback

arXiv.org Artificial IntelligenceMar-11-2025

Reinforcement learning from human feedback (RLHF) has become essential for improving language model capabilities, but traditional approaches rely on the assumption that human preferences follow a transitive Bradley-Terry model. This assumption fails to capture the non-transitive nature of populational human preferences. Nash learning from human feedback (NLHF), targeting non-transitive preferences, is a problem of computing the Nash equilibrium (NE) of the two-player constant-sum game defined by the human preference. We introduce Extragradient preference optimization (EGPO), a novel algorithm for NLHF achieving last-iterate linear convergence to the NE of KL-regularized games and polynomial convergence to the NE of original games, while being robust to noise. Unlike previous approaches that rely on nested optimization, we derive an equivalent implementation using gradients of an online variant of the identity preference optimization (IPO) loss, enabling more faithful implementation for neural networks. Our empirical evaluations demonstrate EGPO's superior performance over baseline methods when training for the same number of epochs, as measured by pairwise win-rates using the ground truth preference.

dual gap dual gap, online ipo 1, online ipo 2, (12 more...)

2503.08942

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Leisure & Entertainment (0.67)
Law > Statutes (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Shi, Ruizhe, Zhou, Runlong, Du, Simon S.

The Crucial Role of Samplers in Online Direct Preference Optimization

arXiv.org Artificial IntelligenceOct-5-2024

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the $\textit{optimization}$ properties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO's $\textit{convergence rates}$ with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textit{linear}$ convergence, while our proposed online sampler achieves $\textit{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and $\textit{logit mixing}$, demonstrating significant improvements over previous approaches. On Safe-RLHF dataset, our method exhibits a $4.5$% improvement over vanilla DPO and a $3.0$% improvement over on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO, on-policy DPO, and Hybrid GSHF by over $4.2$%. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.

dpo-mix-p, dpo-mix-r, sampler, (14 more...)

2409.19605

Country: North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

arXiv.org Machine LearningNov-13-2023

Towards Bounding Causal Effects under Markov Equivalence

Bellot, Alexis

Predicting the effect of unseen interventions is a fundamental research question across the data sciences. It is well established that, in general, such questions cannot be answered definitively from observational data, e.g., as a consequence of unobserved confounding. A generalization of this task is to determine non-trivial bounds on causal effects induced by the data, also known as the task of partial causal identification. In the literature, several algorithms have been developed for solving this problem. Most, however, require a known parametric form or a fully specified causal diagram as input, which is usually not available in practical applications. In this paper, we assume as input a less informative structure known as a Partial Ancestral Graph, which represents a Markov equivalence class of causal diagrams and is learnable from observational data. In this more "data-driven" setting, we provide a systematic algorithm to derive bounds on causal effects that can be computed analytically.

artificial intelligence, causal diagram, machine learning, (18 more...)

2311.07259

Country:

North America > United States > California (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.94)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Kivva, Yaroslav, Etesami, Jalal, Kiyavash, Negar

On Identifiability of Conditional Causal Effects

arXiv.org Artificial IntelligenceJun-19-2023

We address the problem of identifiability of an arbitrary conditional causal effect given both the causal graph and a set of any observational and/or interventional distributions of the form $Q[S]:=P(S|do(V\setminus S))$, where $V$ denotes the set of all observed variables and $S\subseteq V$. We call this problem conditional generalized identifiability (c-gID in short) and prove the completeness of Pearl's $do$-calculus for the c-gID problem by providing sound and complete algorithm for the c-gID problem. This work revisited the c-gID problem in Lee et al. [2020], Correa et al. [2021] by adding explicitly the positivity assumption which is crucial for identifiability. It extends the results of [Lee et al., 2019, Kivva et al., 2022] on general identifiability (gID) which studied the problem for unconditional causal effects and Shpitser and Pearl [2006b] on identifiability of conditional causal effects given merely the observational distribution $P(\mathbf{V})$ as our algorithm generalizes the algorithms proposed in [Kivva et al., 2022] and [Shpitser and Pearl, 2006b].

artificial intelligence, machine learning, realization, (17 more...)

2306.11755

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
Information Technology > Artificial Intelligence > Machine Learning (0.67)

arXiv.org Machine LearningNov-2-2019

Estimating Certain Integral Probability Metric (IPM) is as Hard as Estimating under the IPM

Liang, Tengyuan

In this note, we study the minimax optimal rates for estimati ng the Integral Probability Metrics (IPMs) between probability measures based on samples. IPMs are widely used in both statistics and machine learning, with applications in nonparametric t wo-sample tests [ 23; 10 ], inferring the transportation cost (the Wasserstein-1 metric) from one se t of samples to another [ 22; 20 ], and with more recent appearances in rigorous investigations on the g enerative adversarial networks (GANs) [ 1; 17; 14; 21; 25; 3 ].

arxiv preprint arxiv, ph jk py 1, py 1, (11 more...)

1911.0073

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.88)

Imaizumi, Masaaki, Fukumizu, Kenji

Deep Neural Networks Learn Non-Smooth Functions Effectively

arXiv.org Machine LearningFeb-13-2018

We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain optimal convergence rates, and thus it has been difficult to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive convergence rates of estimators by DNNs with a ReLU activation, and show that the estimators by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results.

artificial intelligence, estimator, machine learning, (16 more...)

1802.04474

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)