AITopics

2503.12946

Country:

Europe (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Semiconductors & Electronics (0.89)

Technology:

Information Technology > Software (0.91)
Information Technology > Artificial Intelligence > Machine Learning (0.46)

arXiv.org Machine LearningMar-1-2025

Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update

Wang, Jing, Zhang, Yu-Jie, Zhao, Peng, Zhou, Zhi-Hua

We study the stochastic linear bandits with heavy-tailed noise. Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits. Nonetheless, these methods rely on specific noise assumptions or bandit structures, limiting their applicability to general settings. The recent work [Huang et al.2024] develops a soft truncation method via the adaptive Huber regression to address these limitations. However, their method suffers undesired computational cost: it requires storing all historical data and performing a full pass over these data at each round. In this paper, we propose a \emph{one-pass} algorithm based on the online mirror descent framework. Our method updates using only current data at each round, reducing the per-round computational cost from $\widetilde{\mathcal{O}}(t \log T)$ to $\widetilde{\mathcal{O}}(1)$ with respect to current round $t$ and the time horizon $T$, and achieves a near-optimal and variance-aware regret of order $\widetilde{\mathcal{O}}\big(d T^{\frac{1-\epsilon}{2(1+\epsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\epsilon}{2(1+\epsilon)}}\big)$ where $d$ is the dimension and $\nu_t^{1+\epsilon}$ is the $(1+\epsilon)$-th central moment of reward at round $t$.

artificial intelligence, data mining, machine learning, (18 more...)

2503.00419

Country: Asia (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

arXiv.org Artificial IntelligenceFeb-18-2025

A Smooth Transition Between Induction and Deduction: Fast Abductive Learning Based on Probabilistic Symbol Perception

Jia, Lin-Han, Han, Si-Yu, Guo, Lan-Zhe, Zhou, Zhi, Li, Zhao-Long, Li, Yu-Feng, Zhou, Zhi-Hua

Abductive learning (ABL) that integrates strengths of machine learning and logical reasoning to improve the learning generalization, has been recently shown effective. However, its efficiency is affected by the transition between numerical induction and symbolical deduction, leading to high computational costs in the worst-case scenario. Efforts on this issue remain to be limited. In this paper, we identified three reasons why previous optimization algorithms for ABL were not effective: insufficient utilization of prediction, symbol relationships, and accumulated experience in successful abductive processes, resulting in redundant calculations to the knowledge base. To address these challenges, we introduce an optimization algorithm named as Probabilistic Symbol Perception (PSP), which makes a smooth transition between induction and deduction and keeps the correctness of ABL unchanged. We leverage probability as a bridge and present an efficient data structure, achieving the transfer from a continuous probability sequence to discrete Boolean sequences with low computational complexity. Experiments demonstrate the promising results.

artificial intelligence, machine learning, sequence, (16 more...)

2502.12919

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Machine LearningFeb-10-2025

Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits

Li, Long-Fei, Qian, Yu-Yang, Zhao, Peng, Zhou, Zhi-Hua

Reinforcement Learning from Human Feedback is a key technique for training large language models using human feedback [Ouyang et al., 2022, Bai et al., 2022]. The RLHF process involves collecting data, each consisting of a prompt, a pair of responses, and a human preference label indicating the preferred response. Then, a reward model is trained to predict human preferences, and the LLM is fine-tuned based on the reward model by RL algorithms, such as PPO [Schulman et al., 2017]. Given the notable success of RLHF, recent efforts have been devoted to developing a deeper theoretical understanding of this approach. Zhu et al. [2023] investigated the standard offline setting, where the learner is provided with a fixed dataset and aims to learn a policy that maximizes the expected reward. In this setting, since the learner has no control over the data collection process, the quality of the dataset becomes crucial to the performance of the learned policy and the resulting policy often performs poorly when faced with out-of-distribution data [Burns et al., 2024]. In practice, the Claude [Bai et al., 2022] and LLaMA-2 [Touvron et al., 2023] projects have demonstrated that iterative RLHF can significantly enhance model performance.

large language model, machine learning, natural language, (18 more...)

2502.07193

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceDec-11-2024

Efficient Rectification of Neuro-Symbolic Reasoning Inconsistencies by Abductive Reflection

Hu, Wen-Chao, Dai, Wang-Zhou, Jiang, Yuan, Zhou, Zhi-Hua

Neuro-Symbolic (NeSy) AI could be regarded as an analogy to human dual-process cognition, modeling the intuitive System 1 with neural networks and the algorithmic System 2 with symbolic reasoning. However, for complex learning targets, NeSy systems often generate outputs inconsistent with domain knowledge and it is challenging to rectify them. Inspired by the human Cognitive Reflection, which promptly detects errors in our intuitive response and revises them by invoking the System 2 reasoning, we propose to improve NeSy systems by introducing Abductive Reflection (ABL-Refl) based on the Abductive Learning (ABL) framework. ABL-Refl leverages domain knowledge to abduce a reflection vector during training, which can then flag potential errors in the neural network outputs and invoke abduction to rectify them and generate consistent outputs during inference. ABL-Refl is highly efficient in contrast to previous ABL implementations. Experiments show that ABL-Refl outperforms state-of-the-art NeSy methods, achieving excellent accuracy with fewer training resources and enhanced efficiency.

artificial intelligence, logic & formal reasoning, machine learning, (17 more...)

2412.08457

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.32)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.93)

arXiv.org Artificial IntelligenceNov-8-2024

WHALE: Towards Generalizable and Scalable World Models for Embodied Decision-making

Zhang, Zhilong, Chen, Ruifeng, Ye, Junyin, Sun, Yihao, Wang, Pengyuan, Pang, Jingcheng, Li, Kaiyuan, Liu, Tianshuo, Lin, Haoxin, Yu, Yang, Zhou, Zhi-Hua

World models play a crucial role in decision-making within embodied environments, enabling cost-free explorations that would otherwise be expensive in the real world. To facilitate effective decision-making, world models must be equipped with strong generalizability to support faithful imagination in out-of-distribution (OOD) regions and provide reliable uncertainty estimation to assess the credibility of the simulated experiences, both of which present significant challenges for prior scalable approaches. This paper introduces WHALE, a framework for learning generalizable world models, consisting of two key techniques: behavior-conditioning and retracing-rollout. Behavior-conditioning addresses the policy distribution shift, one of the primary sources of the world model generalization error, while retracing-rollout enables efficient uncertainty estimation without the necessity of model ensembles. These techniques are universal and can be combined with any neural network architecture for world model learning. Incorporating these two techniques, we present Whale-ST, a scalable spatial-temporal transformer-based world model with enhanced generalizability. We demonstrate the superiority of Whale-ST in simulation tasks by evaluating both value estimation accuracy and video generation fidelity. Additionally, we examine the effectiveness of our uncertainty estimation technique, which enhances model-based policy optimization in fully offline scenarios. Furthermore, we propose Whale-X, a 414M parameter world model trained on 970K trajectories from Open X-Embodiment datasets. We show that Whale-X exhibits promising scalability and strong generalizability in real-world manipulation scenarios using minimal demonstrations.

large language model, machine learning, reinforcement learning, (20 more...)

2411.05619

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

arXiv.org Machine LearningNov-5-2024

Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

Li, Long-Fei, Zhao, Peng, Zhou, Zhi-Hua

We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an $\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$ dynamic regret, where $d$ is the feature dimension, $H$ is the episode length, $K$ is the number of episodes, $\bar{P}_K$ is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.

dynamic regret, machine learning, reinforcement learning, (16 more...)

2411.03107

Country: Asia (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)

arXiv.org Artificial IntelligenceMay-27-2024

Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

Li, Long-Fei, Zhang, Yu-Jie, Zhao, Peng, Zhou, Zhi-Hua

We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its benefits, introducing non-linear function approximation raises significant challenges in both computational and statistical efficiency. The best-known method of Hwang and Oh [2023] has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent quantity, $d$ is the feature space dimension, $H$ is the episode length, and $K$ is the number of episodes. While this result attains the same rate in $K$ as the linear cases, the method requires storing all historical data and suffers from an $\mathcal{O}(K)$ computation cost per episode. Moreover, the quantity $\kappa$ can be exponentially small, leading to a significant gap for the regret compared to the linear cases. In this work, we first address the computational concerns by proposing an online algorithm that achieves the same regret with only $\mathcal{O}(1)$ computation cost. Then, we design two algorithms that leverage local information to enhance statistical efficiency. They not only maintain an $\mathcal{O}(1)$ computation cost per episode but achieve improved regrets of $\widetilde{\mathcal{O}}(\kappa^{-1/2}dH^2\sqrt{K})$ and $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + \kappa^{-1}d^2H^2)$ respectively. Finally, we establish a lower bound, justifying the optimality of our results in $d$ and $K$. To the best of our knowledge, this is the first work that achieves almost the same computational and statistical efficiency as linear function approximation while employing non-linear function approximation for reinforcement learning.

artificial intelligence, fuzzy logic, machine learning, (19 more...)

2405.17061

Country: Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Machine LearningMar-7-2024

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Li, Long-Fei, Zhao, Peng, Zhou, Zhi-Hua

We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, we focus on linear mixture MDPs whose transition kernel is a linear mixture model. We propose a new algorithm that attains an $\widetilde{O}(d\sqrt{HS^3K} + \sqrt{HSAK})$ regret with high probability, where $d$ is the dimension of feature mappings, $S$ is the size of state space, $A$ is the size of action space, $H$ is the episode length and $K$ is the number of episodes. Our result strictly improves the previous best-known $\widetilde{O}(dS^2 \sqrt{K} + \sqrt{HSAK})$ result in Zhao et al. (2023a) since $H \leq S$ holds by the layered MDP structure. Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises, originally proposed in the dynamic assortment area and firstly applied in reinforcement learning to handle correlations between different states.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2403.04568

Country:

Asia (0.14)
Europe (0.14)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)

arXiv.org Artificial IntelligenceJan-24-2024

Beimingwu: A Learnware Dock System

Tan, Zhi-Hao, Liu, Jian-Dong, Bi, Xiao-Dong, Tan, Peng, Zheng, Qin-Cheng, Liu, Hai-Tian, Xie, Yi, Zou, Xiao-Chuan, Yu, Yang, Zhou, Zhi-Hua

The learnware paradigm proposed by Zhou [2016] aims to enable users to reuse numerous existing well-trained models instead of building machine learning models from scratch, with the hope of solving new user tasks even beyond models' original purposes. In this paradigm, developers worldwide can submit their high-performing models spontaneously to the learnware dock system (formerly known as learnware market) without revealing their training data. Once the dock system accepts the model, it assigns a specification and accommodates the model. This specification allows the model to be adequately identified and assembled to reuse according to future users' needs, even if they have no prior knowledge of the model. This paradigm greatly differs from the current big model direction and it is expected that a learnware dock system housing millions or more high-performing models could offer excellent capabilities for both planned tasks where big models are applicable; and unplanned, specialized, data-sensitive scenarios where big models are not present or applicable. This paper describes Beimingwu, the first open-source learnware dock system providing foundational support for future research of learnware paradigm.The system significantly streamlines the model development for new user tasks, thanks to its integrated architecture and engine design, extensive engineering implementations and optimizations, and the integration of various algorithms for learnware identification and reuse. Notably, this is possible even for users with limited data and minimal expertise in machine learning, without compromising the raw data's security. Beimingwu supports the entire process of learnware paradigm. The system lays the foundation for future research in learnware-related algorithms and systems, and prepares the ground for hosting a vast array of learnwares and establishing a learnware ecosystem.

artificial intelligence, learnware, machine learning, (17 more...)

2401.14427

Country: Asia (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)