AITopics | Duan, Yaqi

Collaborating Authors

Duan, Yaqi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PILAF: Optimal Human Preference Sampling for Reward Modeling

Feng, Yunzhen, Kwiatkowski, Ariel, Zheng, Kunhao, Kempe, Julia, Duan, Yaqi

arXiv.org Machine LearningFeb-6-2025

As large language models increasingly drive real-world applications, aligning them with human values becomes paramount. Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique, translating preference data into reward models when oracle human values remain inaccessible. In practice, RLHF mostly relies on approximate reward models, which may not consistently guide the policy toward maximizing the underlying human values. We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling that explicitly aligns preference learning with maximizing the underlying oracle reward. PILAF is theoretically grounded, demonstrating optimality from both an optimization and a statistical perspective. The method is straightforward to implement and demonstrates strong performance in iterative and online RLHF settings where feedback curation is critical.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2502.0427

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Localized exploration in contextual dynamic pricing achieves dimension-free regret

Chai, Jinhang, Duan, Yaqi, Fan, Jianqing, Wang, Kaizheng

arXiv.org Machine LearningDec-26-2024

We study the problem of contextual dynamic pricing with a linear demand model. We propose a novel localized exploration-then-commit (LetC) algorithm which starts with a pure exploration stage, followed by a refinement stage that explores near the learned optimal pricing policy, and finally enters a pure exploitation stage. The algorithm is shown to achieve a minimax optimal, dimension-free regret bound when the time horizon exceeds a polynomial of the covariate dimension. Furthermore, we provide a general theoretical framework that encompasses the entire time spectrum, demonstrating how to balance exploration and exploitation when the horizon is limited. The analysis is powered by a novel critical inequality that depicts the exploration-exploitation trade-off in dynamic pricing, mirroring its existing counterpart for the bias-variance trade-off in regularized regression. Our theoretical results are validated by extensive experiments on synthetic and real-world data.

artificial intelligence, inequality, machine learning, (19 more...)

arXiv.org Machine Learning

2412.19252

Genre: Research Report (0.81)

Industry: Energy > Oil & Gas > Upstream (0.86)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces

Duan, Yaqi, Wainwright, Martin J.

arXiv.org Machine LearningJan-10-2024

We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2401.05233

Country: North America > United States (0.27)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Adaptive and Robust Multi-task Learning

Duan, Yaqi, Wang, Kaizheng

arXiv.org Machine LearningFeb-10-2022

Multi-task learning (MTL) solves a number of learning tasks simultaneously. It has become increasingly popular in modern applications with data generated by multiple sources. When the tasks share certain common structures, a properly chosen MTL algorithm can leverage that to improve the performance. However, task relatedness is usually unknown and hard to quantify in practice; heterogeneity can even make multi-task approaches perform worse than single-task learning, which trains models separately on their individual datasets. In this paper, we study MTL from a statistical perspective and develop a family of reliable approaches that adapt to the unknown task relatedness and are robust against outlier tasks with possibly contaminated data.

artificial intelligence, machine learning, survey article, (19 more...)

arXiv.org Machine Learning

2202.0525

Country:

North America > United States > New York (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.67)

Add feedback

Optimal policy evaluation using kernel-based temporal difference methods

Duan, Yaqi, Wang, Mengdi, Wainwright, Martin J.

arXiv.org Machine LearningSep-24-2021

We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate; in the population limit of infinite data, it corresponds to the fixed point of a projected Bellman operator defined by the associated reproducing kernel Hilbert space. The estimator itself is obtained by computing the projected fixed point induced by a regularized version of the empirical operator; due to the underlying kernel structure, this reduces to solving a linear system involving kernel matrices. We analyze the error of this estimate in the $L^2(\mu)$-norm, where $\mu$ denotes the stationary distribution of the underlying Markov chain. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size $n$ and the effective horizon $H = (1 - \gamma)^{-1}$. Whereas existing worst-case theory predicts cubic scaling ($H^3$) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2109.12002

Country: North America > United States > Massachusetts > Middlesex County (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.55)

Add feedback

Learning Good State and Action Representations via Tensor Decomposition

Ni, Chengzhuo, Zhang, Anru, Duan, Yaqi, Wang, Mengdi

arXiv.org Machine LearningMay-3-2021

The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP's tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding.

action space, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2105.01136

Country: North America > United States > Wisconsin > Dane County > Madison (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.66)

Add feedback

Risk Bounds and Rademacher Complexity in Batch Reinforcement Learning

Duan, Yaqi, Jin, Chi, Li, Zhiyuan

arXiv.org Artificial IntelligenceMar-25-2021

This paper considers batch Reinforcement Learning (RL) with general value function approximation. Our study investigates the minimal assumptions to reliably estimate/minimize Bellman error, and characterizes the generalization performance by (local) Rademacher complexities of general function classes, which makes initial steps in bridging the gap between statistical learning theory and batch RL. Concretely, we view the Bellman error as a surrogate loss for the optimality gap, and prove the followings: (1) In double sampling regime, the excess risk of Empirical Risk Minimizer (ERM) is bounded by the Rademacher complexity of the function class. (2) In the single sampling regime, sample-efficient risk minimization is not possible without further assumptions, regardless of algorithms. However, with completeness assumptions, the excess risk of FQI and a minimax style algorithm can be again bounded by the Rademacher complexity of the corresponding function classes. (3) Fast statistical rates can be achieved by using tools of local Rademacher complexity. Our analysis covers a wide range of function classes, including finite classes, linear spaces, kernel spaces, sparse linear features, etc.

artificial intelligence, rademacher complexity, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2103.13883

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback

Bootstrapping Statistical Inference for Off-Policy Evaluation

Hao, Botao, Ji, Xiang, Duan, Yaqi, Lu, Hao, Szepesvári, Csaba, Wang, Mengdi

arXiv.org Machine LearningFeb-9-2021

Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

artificial intelligence, bootstrapping statistical inference, health & medicine, (15 more...)

arXiv.org Machine Learning

2102.03607

Country: North America > Canada > Alberta (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient

Hao, Botao, Duan, Yaqi, Lattimore, Tor, Szepesvári, Csaba, Wang, Mengdi

arXiv.org Machine LearningNov-8-2020

This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient. We first consider the off-policy policy evaluation problem. To evaluate a new target policy, we analyze a Lasso fitted Q-evaluation method and establish a finite-sample error bound that has no polynomial dependence on the ambient dimension. To reduce the Lasso bias, we further propose a post model-selection estimator that applies fitted Q-evaluation to the features selected via group Lasso. Under an additional signal strength assumption, we derive a sharper instance-dependent error bound that depends on a divergence function measuring the distribution mismatch between the data distribution and occupancy measure of the target policy. Further, we study the Lasso fitted Q-iteration for batch policy optimization and establish a finite-sample error bound depending on the ratio between the number of relevant features and restricted minimal eigenvalue of the data's covariance. In the end, we complement the results with minimax lower bounds for batch-data policy evaluation/optimization that nearly match our upper bounds. The results suggest that having well-conditioned data is crucial for sparse batch policy learning.

artificial intelligence, probability, reinforcement learning, (13 more...)

arXiv.org Machine Learning

2011.04019

Country:

North America (0.28)
Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Learning low-dimensional state embeddings and metastable clusters from time series data

Sun, Yifan, Duan, Yaqi, Gong, Hao, Wang, Mengdi

Neural Information Processing SystemsMar-18-2020, 22:17:15 GMT

This paper studies how to find compact state embeddings from high-dimensional Markov state trajectories, where the transition kernel has a small intrinsic rank. In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process's dynamics. This idea also leads to a kernel reshaping method for more accurate nonparametric estimation of the transition function. State embedding can be used to cluster states into metastable sets, thereby identifying the slow dynamics. Sharp statistical error bounds and misclassification rate are proved.

artificial intelligence, low-dimensional state, machine learning, (4 more...)

Neural Information Processing Systems

Genre: Research Report (0.45)

Industry: Leisure & Entertainment > Games (0.45)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.77)

Add feedback