AITopics

2502.08788

Country:

Asia (0.67)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceFeb-12-2025

Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

Qi, Han, Yang, Haochen, Zhang, Qiaosheng, Yang, Zhuoran

We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(\epsilon)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2502.05434

Country: Asia > China (0.14)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceFeb-6-2025

Online Preference Alignment for Language Models via Count-based Exploration

Bai, Chenjia, Zhang, Yang, Qiu, Shuang, Zhang, Qiaosheng, Xu, Kang, Li, Xuelong

Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage, and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e. \emph{how to explore} for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the UCB-term can be converted to a count-based exploration bonus. We further propose a practical algorithm, named \emph{Count-based Online Preference Optimization (COPO)}, which leverages a simple coin-flip counting module to estimate the pseudo-count of a prompt-response pair in previously collected data. COPO encourages LLMs to balance exploration and preference optimization in an iterative manner, which enlarges the exploration space and the entire data coverage of iterative LLM policies. We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

2501.12735

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Machine LearningJan-27-2025

Community Detection for Contextual-LSBM: Theoretical Limitations of Misclassification Rate and Efficient Algorithms

Jin, Dian, Zhang, Yuqian, Zhang, Qiaosheng

The integration of network information and node attribute information has recently gained significant attention in the community detection literature. In this work, we consider community detection in the Contextual Labeled Stochastic Block Model (CLSBM), where the network follows an LSBM and node attributes follow a Gaussian Mixture Model (GMM). Our primary focus is the misclassification rate, which measures the expected number of nodes misclassified by community detection algorithms. We first establish a lower bound on the optimal misclassification rate that holds for any algorithm. When we specialize our setting to the LSBM (which preserves only network information) or the GMM (which preserves only node attribute information), our lower bound recovers prior results. Moreover, we present an efficient spectral-based algorithm tailored for the CLSBM and derive an upper bound on its misclassification rate. Although the algorithm does not attain the lower bound, it serves as a reliable starting point for designing more accurate community detection algorithms (as many algorithms use spectral method as an initial step, followed by refinement procedures to enhance accuracy).

data mining, machine learning, node, (17 more...)

2501.11139

Genre: Research Report (0.50)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

arXiv.org Artificial IntelligenceJan-24-2025

Graph Feedback Bandits on Similar Arms: With and Without Graph Structures

Qi, Han, Guo, Fei, Zhu, Li, Zhang, Qiaosheng, Li, Xuelong

In this paper, we study the stochastic multi-armed bandit problem with graph feedback. Motivated by applications in clinical trials and recommendation systems, we assume that two arms are connected if and only if they are similar (i.e., their means are close to each other). We establish a regret lower bound for this problem under the novel feedback structure and introduce two upper confidence bound (UCB)-based algorithms: Double-UCB, which has problem-independent regret upper bounds, and Conservative-UCB, which has problem-dependent upper bounds. Leveraging the similarity structure, we also explore a scenario where the number of arms increases over time (referred to as the \emph{ballooning setting}). Practical applications of this scenario include Q\&A platforms (e.g., Reddit, Stack Overflow, Quora) and product reviews on platforms like Amazon and Flipkart, where answers (or reviews) continuously appear, and the goal is to display the best ones at the top. We extend these two UCB-based algorithms to the ballooning setting. Under mild assumptions, we provide regret upper bounds for both algorithms and discuss their sub-linearity. Furthermore, we propose a new version of the corresponding algorithms that do not rely on prior knowledge of the graph's structural information and provide regret upper bounds. Finally, we conduct experiments to validate the theoretical results.

artificial intelligence, data mining, machine learning, (20 more...)

2501.14314

Country: Asia > China (0.46)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)

arXiv.org Machine LearningDec-19-2024

Understanding When and Why Graph Attention Mechanisms Work via Node Classification

Ma, Zhongtian, Zhang, Qiaosheng, Zhou, Bocheng, Zhang, Yexin, Hu, Shuyue, Wang, Zhen

Despite the growing popularity of graph attention mechanisms, their theoretical understanding remains limited. This paper aims to explore the conditions under which these mechanisms are effective in node classification tasks through the lens of Contextual Stochastic Block Models (CSBMs). Our theoretical analysis reveals that incorporating graph attention mechanisms is \emph{not universally beneficial}. Specifically, by appropriately defining \emph{structure noise} and \emph{feature noise} in graphs, we show that graph attention mechanisms can enhance classification performance when structure noise exceeds feature noise. Conversely, when feature noise predominates, simpler graph convolution operations are more effective. Furthermore, we examine the over-smoothing phenomenon and show that, in the high signal-to-noise ratio (SNR) regime, graph convolutional networks suffer from over-smoothing, whereas graph attention mechanisms can effectively resolve this issue. Building on these insights, we propose a novel multi-layer Graph Attention Network (GAT) architecture that significantly outperforms single-layer GATs in achieving \emph{perfect node classification} in CSBMs, relaxing the SNR requirement from $ \omega(\sqrt{\log n}) $ to $ \omega(\sqrt{\log n} / \sqrt[3]{n}) $. To our knowledge, this is the first study to delineate the conditions for perfect node classification using multi-layer GATs. Our theoretical contributions are corroborated by extensive experiments on both synthetic and real-world datasets, highlighting the practical implications of our findings.

artificial intelligence, data mining, machine learning, (18 more...)

2412.15496

Country:

North America > United States (0.45)
Europe (0.27)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Machine LearningJun-3-2024

A Theory of Learnability for Offline Decision Making

Mao, Chenjie, Zhang, Qiaosheng

We study the problem of offline decision making, which focuses on learning decisions from datasets only partially correlated with the learning objective. While previous research has extensively studied specific offline decision making problems like offline reinforcement learning (RL) and off-policy evaluation (OPE), a unified framework and theory remain absent. To address this gap, we introduce a unified framework termed Decision Making with Offline Feedback (DMOF), which captures a wide range of offline decision making problems including offline RL, OPE, and offline partially observable Markov decision processes (POMDPs). For the DMOF framework, we introduce a hardness measure called the Offline Estimation Coefficient (OEC), which measures the learnability of offline decision making problems and is also reflected in the derived minimax lower bounds. Additionally, we introduce an algorithm called Empirical Decision with Divergence (EDD), for which we establish both an instance-dependent upper bound and a minimax upper bound. The minimax upper bound almost matches the lower bound determined by the OEC. Finally, we show that EDD achieves a fast convergence rate (i.e., a rate scaling as $1/N$, where $N$ is the sample size) for specific settings such as supervised learning and Markovian sequential problems~(e.g., MDPs) with partial coverage.

artificial intelligence, assumption, machine learning, (14 more...)

2406.01378

Country: North America (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

arXiv.org Artificial IntelligenceMay-24-2024

Constrained Ensemble Exploration for Unsupervised Skill Discovery

Bai, Chenjia, Yang, Rushuai, Zhang, Qiaosheng, Xu, Kang, Chen, Yi, Xiao, Ting, Li, Xuelong

Unsupervised Reinforcement Learning (RL) provides a promising paradigm for learning useful behaviors via reward-free per-training. Existing methods for unsupervised RL mainly conduct empowerment-driven skill discovery or entropy-based exploration. However, empowerment often leads to static skills, and pure exploration only maximizes the state coverage rather than learning useful behaviors. In this paper, we propose a novel unsupervised RL framework via an ensemble of skills, where each skill performs partition exploration based on the state prototypes. Thus, each skill can explore the clustered area locally, and the ensemble skills maximize the overall state coverage. We adopt state-distribution constraints for the skill occupancy and the desired cluster for learning distinguishable skills. Theoretical analysis is provided for the state entropy and the resulting skill distributions. Based on extensive experiments on several challenging tasks, we find our method learns well-explored ensemble skills and achieves superior performance in various downstream tasks compared to previous methods.

machine learning, natural language, reinforcement learning, (15 more...)

2405.1603

Country:

Asia > China (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceMay-12-2024

Ensemble Successor Representations for Task Generalization in Offline-to-Online Reinforcement Learning

Wang, Changhong, Yu, Xudong, Bai, Chenjia, Zhang, Qiaosheng, Wang, Zhen

In Reinforcement Learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2405.07223

Country:

Asia > China (0.70)
North America > United States (0.46)
Europe > Austria > Vienna (0.14)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.66)
Instructional Material > Online (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningApr-30-2024

Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning

Zhang, Qiaosheng, Bai, Chenjia, Hu, Shuyue, Wang, Zhen, Li, Xuelong

This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS). These algorithms draw inspiration from foundational concepts in information theory, and are proven to be sample efficient in MARL settings such as two-player zero-sum Markov games (MGs) and multi-player general-sum MGs. For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium. The basic algorithm, referred to as MAIDS, employs an asymmetric learning structure where the max-player first solves a minimax optimization problem based on the joint information ratio of the joint policy, and the min-player then minimizes the marginal information ratio with the max-player's policy fixed. Theoretical analyses show that it achieves a Bayesian regret of tilde{O}(sqrt{K}) for K episodes. To reduce the computational load of MAIDS, we develop an improved algorithm called Reg-MAIDS, which has the same Bayesian regret bound while enjoying less computational complexity. Moreover, by leveraging the flexibility of IDS principle in choosing the learning target, we propose two methods for constructing compressed environments based on rate-distortion theory, upon which we develop an algorithm Compressed-MAIDS wherein the learning target is a compressed environment. Finally, we extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2404.19292

Country: Asia > China (0.14)

Genre: Research Report (0.81)

Industry: Energy > Oil & Gas > Upstream (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)