AITopics

2511.03724

Country: North America > United States > Texas (0.24)

Genre: Research Report (0.64)

Industry:

Banking & Finance > Trading (1.00)
Leisure & Entertainment > Games > Poker (0.48)
Leisure & Entertainment > Games > Chess (0.46)
Leisure & Entertainment > Games > Go (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceNov-10-2025

L2T-Tune:LLM-Guided Hybrid Database Tuning with LHS and TD3

Yang, Xinyue, Zheng, Chen, Hou, Yaoyang, Zhang, Renhao, Zhang, Yinyan, Wu, Yanjun, Zhang, Heng

Configuration tuning is critical for database performance. Although recent advancements in database tuning have shown promising results in throughput and latency improvement, challenges remain. First, the vast knob space makes direct optimization unstable and slow to converge. Second, reinforcement learning pipelines often lack effective warm-start guidance and require long offline training. Third, transferability is limited: when hardware or workloads change, existing models typically require substantial retraining to recover performance. To address these limitations, we propose L2T-Tune, a new LLM-guided hybrid database tuning framework that features a three-stage pipeline: Stage one performs a warm start that simultaneously generates uniform samples across the knob space and logs them into a shared pool; Stage two leverages a large language model to mine and prioritize tuning hints from manuals and community documents for rapid convergence. Stage three uses the warm-start sample pool to reduce the dimensionality of knobs and state features, then fine-tunes the configuration with the Twin Delayed Deep Deterministic Policy Gradient algorithm. We conduct experiments on L2T-Tune and the state-of-the-art models. Compared with the best-performing alternative, our approach improves performance by an average of 37.1% across all workloads, and by up to 73% on TPC-C. Compared with models trained with reinforcement learning, it achieves rapid convergence in the offline tuning stage on a single server. Moreover, during the online tuning stage, it only takes 30 steps to achieve best results.

large language model, machine learning, reinforcement learning, (19 more...)

2511.01602

Country: Asia > China (0.48)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Zu, Lipeng, Zhou, Hansong, Zhang, Xiaonan

From Static to Dynamic: Enhancing Offline-to-Online Reinforcement Learning via Energy-Guided Diffusion Stratification

Transitioning from offline to online reinforcement learning (RL) poses critical challenges due to distributional shifts between the fixed behavior policy in the offline dataset and the evolving policy during online learning. Although this issue is widely recognized, few methods attempt to explicitly assess or utilize the distributional structure of the offline data itself, leaving a research gap in adapting learning strategies to different types of samples. To address this challenge, we propose an innovative method, Energy-Guided Diffusion Stratification (StratDiff), which facilitates smoother transitions in offline-to-online RL. StratDiff deploys a diffusion model to learn prior knowledge from the offline dataset. It then refines this knowledge through energy-based functions to improve policy imitation and generate offline-like actions during online fine-tuning. The KL divergence between the generated action and the corresponding sampled action is computed for each sample and used to stratify the training batch into offline-like and online-like subsets. Offline-like samples are updated using offline objectives, while online-like samples follow online learning strategies. We demonstrate the effectiveness of StratDiff by integrating it with off-the-shelf methods Cal-QL and IQL. Extensive empirical evaluations on D4RL benchmarks show that StratDiff significantly outperforms existing methods, achieving enhanced adaptability and more stable performance across diverse RL settings.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2511.03828

Genre: Research Report (0.83)

Industry: Education > Educational Setting > Online (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Zhiyuan, Zeng, Liu, Jiashuo, Yin, Zhangyue, Zhang, Ge, Huang, Wenhao, Qiu, Xipeng

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

machine learning, reinforcement learning, rloop, (15 more...)

2511.04285

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

Liu, Anjie, Wang, Jianhong, Kaski, Samuel, Wang, Jun, Yang, Mengyue

Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing external mechanisms (e.g., intrinsic rewards and human feedback) to coordinate agents mostly relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce the concept of MARL interaction paradigms (orthogonal to MARL learning paradigms), using MAIDs to analyze and visualize both unguided self-organization and global guidance mechanisms in MARL. Then, we design a new MARL interaction paradigm, referred to as the targeted intervention paradigm that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In implementation, we introduce a causal inference technique, referred to as Pre-Strategy Intervention (PSI), to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an MARL interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2510.17697

Country: Europe > Austria (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (0.92)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)

Åström, Hampus, Topp, Elin Anna, Malec, Jacek

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

In this paper we study how transforming regular reinforcement learning environments into goal-conditioned environments can let agents learn to solve tasks autonomously and reward-free. We show that an agent can learn to solve tasks by selecting its own goals in an environment-agnostic way, at training times comparable to externally guided reinforcement learning. Our method is independent of the underlying off-policy learning algorithm. Since our method is environment-agnostic, the agent does not value any goals higher than others, leading to instability in performance for individual goals. However, in our experiments, we show that the average goal success rate improves and stabilizes. An agent trained with this method can be instructed to seek any observations made in the environment, enabling generic training of agents prior to specific use cases.

machine learning, reinforcement learning, selection, (15 more...)

2511.04598

Genre: Research Report > New Finding (0.34)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)

Temporal Action Selection for Action Chunking

Weng, Yueyang, Zhang, Xiaopeng, Mu, Yongjin, Zhu, Yingcong, Li, Yanjie, Liu, Qi

Action chunking is a widely adopted approach in Learning from Demonstration (LfD). By modeling multi-step action chunks rather than single-step actions, action chunking significantly enhances modeling capabilities for human expert policies. However, the reduced decision frequency restricts the utilization of recent observations, degrading reactivity - particularly evident in the inadequate adaptation to sensor noise and dynamic environmental changes. Existing efforts to address this issue have primarily resorted to trading off reactivity against decision consistency, without achieving both. To address this limitation, we propose a novel algorithm, Temporal Action Selector (TAS), which caches predicted action chunks from multiple timesteps and dynamically selects the optimal action through a lightweight selector network. TAS achieves balanced optimization across three critical dimensions: reactivity, decision consistency, and motion coherence. Experiments across multiple tasks with diverse base policies show that TAS significantly improves success rates - yielding an absolute gain of up to 73.3%. Furthermore, integrating TAS as a base policy with residual reinforcement learning (RL) substantially enhances training efficiency and elevates the performance plateau. Experiments in both simulation and physical robots confirm the method's efficacy.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2511.04421

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Iannotta, Marco, Yang, Yuxuan, Stork, Johannes A., Schaffernicht, Erik, Stoyanov, Todor

Can Context Bridge the Reality Gap? Sim-to-Real Transfer of Context-Aware Policies

Sim-to-real transfer remains a major challenge in reinforcement learning (RL) for robotics, as policies trained in simulation often fail to generalize to the real world due to discrepancies in environment dynamics. While standard approaches typically train policies agnostic to these variations, we investigate whether sim-to-real transfer can be improved by conditioning the policy on an estimate of the dynamics parameters -- referred to as context. To this end, we integrate a context estimation module into a DR-based RL framework and systematically compare SOTA supervision strategies. We evaluate the resulting context-aware policies in both a canonical control benchmark and a real-world pushing task using a Franka Emika Panda robot. Results show that context-aware policies outperform the context-agnostic baseline across all settings, although the best supervision strategy depends on the task. Introduction Reinforcement learning (RL) has achieved significant success in developing robot controllers capable of solving complex tasks [1]. To address these limitations, physics simulation engines are widely used as a safer and more efficient alternative for policy training. Once a policy has been trained in simulation, it is transferred to the physical robot--a process known as sim-to-real transfer [2, 1, 3]. Although promising, this paradigm is hindered by the reality or sim-to-real gap, which refers to the discrepancy between the simulated and real-world environments [4, 5].

machine learning, reinforcement learning, supervision strategy, (17 more...)

2511.04249

Country: Europe (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

Yang, Claire, Cakmak, Maya, Kleiman-Weiner, Max

When Empowerment Disempowers

Empowerment, a measure of an agent's ability to control its environment, has been proposed as a universal goal-agnostic objective for motivating assistive behavior in AI agents. While multi-human settings like homes and hospitals are promising for AI assistance, prior work on empowerment-based assistance assumes that the agent assists one human in isolation. We introduce an open source multi-human gridworld test suite Disempower-Grid. Using Disempower-Grid, we empirically show that assistive RL agents optimizing for one human's empowerment can significantly reduce another human's environmental influence and rewards - a phenomenon we formalize as disempowerment. We characterize when disempowerment occurs in these environments and show that joint empowerment mitigates disempowerment at the cost of the user's reward. Our work reveals a broader challenge for the AI alignment community: goal-agnostic objectives that seem aligned in single-agent settings can become misaligned in multi-agent contexts.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2511.04177

Country: North America > United States (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning

Zhang, Jiaming, Yang, Yujie, Wang, Haoning, Zhang, Liping, Li, Shengbo Eben

Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.

constraint, machine learning, reinforcement learning, (16 more...)

2511.04147

Country: Asia > China (0.15)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Food & Agriculture > Agriculture (0.47)
Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)