AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

REALISM: A Regulatory Framework for Coordinated Scheduling in Multi-Operator Shared Micromobility Services

Tan, Heng, Yan, Hua, Yuan, Yukun, Wang, Guang, Yang, Yu

arXiv.org Artificial IntelligenceAug-15-2025

Shared micromobility (e.g., shared bikes and electric scooters), as a kind of emerging urban transportation, has become more and more popular in the world. However, the blooming of shared micromobility vehicles brings some social problems to the city (e.g., overloaded vehicles on roads, and the inequity of vehicle deployment), which deviate from the city regulator's expectation of the service of the shared micromobility system. In addition, the multi-operator shared micromobility system in a city complicates the problem because of their non-cooperative self-interested pursuits. Existing regulatory frameworks of multi-operator vehicle rebalancing generally assume the intrusive control of vehicle rebalancing of all the operators, which is not practical in the real world. To address this limitation, we design REALISM, a regulatory framework for coordinated scheduling in multi-operator shared micromobility services that incorporates the city regulator's regulations in the form of assigning a score to each operator according to the city goal achievements and operators' individual contributions to achieving the city goal, measured by Shapley value. To realize the fairness-aware score assignment, we measure the fairness of assigned scores and use them as one of the components to optimize the score assignment model. To optimize the whole framework, we develop an alternating procedure to make operators and the city regulator interact with each other until convergence. We evaluate our framework based on real-world e-scooter usage data in Chicago. Our experiment results show that our method achieves a performance gain of at least 39.93% in the equity of vehicle usage and 1.82% in the average demand satisfaction of the whole city.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2508.10166

Country:

North America > United States > Illinois > Cook County > Chicago (0.26)
North America > United States > Minnesota > Hennepin County (0.16)

Genre: Research Report > New Finding (0.66)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Transportation > Electric Vehicle (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

Add feedback

A Personalized Exercise Assistant using Reinforcement Learning (PEARL): Results from a four-arm Randomized-controlled Trial

Lee, Amy Armento, Hegde, Narayan, Deliu, Nina, Rosenzweig, Emily, Suggala, Arun, Lakshminarasimhan, Sriram, He, Qian, Hernandez, John, Seneviratne, Martin, Singh, Rahul, Kalkar, Pradnesh, Shanmugam, Karthikeyan, Raghuveer, Aravindan, Singh, Abhimanyu, Nguyen, My, Taylor, James, Alla, Jatin, Villar, Sofia S., Emir-Farinas, Hulya

arXiv.org Artificial IntelligenceAug-15-2025

Consistent physical inactivity poses a major global health challenge. Mobile health (mHealth) interventions, particularly Just-in-Time Adaptive Interventions (JITAIs), offer a promising avenue for scalable, personalized physical activity (PA) promotion. However, developing and evaluating such interventions at scale, while integrating robust behavioral science, presents methodological hurdles. The PEARL study was the first large-scale, four-arm randomized controlled trial to assess a reinforcement learning (RL) algorithm, informed by health behavior change theory, to personalize the content and timing of PA nudges via a Fitbit app. We enrolled and randomized 13,463 Fitbit users into four study arms: control, random, fixed, and RL. The control arm received no nudges. The other three arms received nudges from a bank of 155 nudges based on behavioral science principles. The random arm received nudges selected at random. The fixed arm received nudges based on a pre-set logic from survey responses about PA barriers. The RL group received nudges selected by an adaptive RL algorithm. We included 7,711 participants in primary analyses (mean age 42.1, 86.3% female, baseline steps 5,618.2). We observed an increase in PA for the RL group compared to all other groups from baseline to 1 and 2 months. The RL group had significantly increased average daily step count at 1 month compared to all other groups: control (+296 steps, p=0.0002), random (+218 steps, p=0.005), and fixed (+238 steps, p=0.002). At 2 months, the RL group sustained a significant increase compared to the control group (+210 steps, p=0.0122). Generalized estimating equation models also revealed a sustained increase in daily steps in the RL group vs. control (+208 steps, p=0.002). These findings demonstrate the potential of a scalable, behaviorally-informed RL approach to personalize digital health interventions for PA.

machine learning, nudge, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2508.1006

Country:

Europe (1.00)
North America > United States (0.93)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.93)
Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Compass-Thinker-7B Technical Report

Zeng, Anxiang, Zhang, Haibo, Mo, Kaixiang, Zhang, Long, Liu, Shuman, Huang, Yanhui, Liu, Yawen, Sheng, Yuepeng, Huang, Yuwei

arXiv.org Artificial IntelligenceAug-15-2025

Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core technology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We propose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learning with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a specially designed Reinforcement Learning Pipeline. We curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with different difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL model. Especially in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.

large language model, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2508.08909

Genre: Research Report (0.89)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

69eba34671b3ef1ef38ee85caae6b2a1-Paper.pdf

Neural Information Processing SystemsAug-14-2025, 23:18:55 GMT

learning, reinforcement learning, representation, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

57444e14ecd9e2c8f603b4f012ce3811-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 23:12:52 GMT

agent, decentralized shield, shield, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.04)
North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.46)

Add feedback

68331ff0427b551b68e911eebe35233b-Paper.pdf

Neural Information Processing SystemsAug-14-2025, 22:54:17 GMT

learning, prediction, successor feature, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California (0.04)
(2 more...)

Industry: Education (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Cross-Modal Domain Adaptation for Cost-Efficient Visual Reinforcement Learning Xiong-Hui Chen

Neural Information Processing SystemsAug-14-2025, 22:52:26 GMT

Previous methods focus on same-modal domain adaptation.

mapping function, source domain, target domain, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)

Add feedback

Variational Bayesian Optimistic Sampling

Neural Information Processing SystemsAug-14-2025, 22:39:22 GMT

We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian'optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys O ( AT) Bayesian regret for a problem with A actions after T rounds. We extend the regret analysis for optimistic policies to bilinear saddle-point problems which include zero-sum matrix games and constrained bandits as special cases. In this case we show that Thompson sampling can produce policies outside of the optimistic set and suffer linear regret in some instances. Finding a policy inside the optimistic set amounts to solving a convex optimization problem and we call the resulting algorithm'variational Bayesian optimistic sampling' (VBOS). The procedure works for any posteriors, i.e., it does not require the posterior to have any special properties, such as log-concavity, unimodality, or smoothness. The variational view of the problem has many useful properties, including the ability to tune the exploration-exploitation tradeoff, add regularization, incorporate constraints, and linearly parameterize the policy.

data mining, machine learning, reinforcement learning, (20 more...)

Neural Information Processing Systems

Country: Europe (0.14)

Industry: Energy > Oil & Gas > Upstream (0.54)

Technology: