AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Grounded ReinforcementLearning: LearningtoWintheGameunderHumanCommands SupplementaryMaterials

Neural Information Processing SystemsFeb-8-2026, 05:06:43 GMT

Inthis section, we describe the details ofMiniRTSEnvironment and human dataset. The data do not contain any personally identifiable information or offensivecontent. Figure 1: MiniRTS [2]implements the rockpaper-scissors attack graph, each army type has some units it is effective against and vulnerableto. "swordman","spearman"and"cavalry"allare effectiveagainst"archer" Figure 2: Building units can produce different army units using resources. Resource Units: Resource units are stationary and neutral.

catapult, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Industry: Government > Military > Army (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)

Add feedback

Grounded ReinforcementLearning: LearningtoWintheGameunderHumanCommands

Neural Information Processing SystemsFeb-8-2026, 05:06:40 GMT

From the RL perspective, it is extremely challenging to derive a precise rewardfunction forhuman preferences since thecommands areabstract and the valid behaviors are highly complicated and multi-modal.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country:

Europe > Czechia > Prague (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

34f98c7c5d7063181da890ea8d25265a-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 05:04:54 GMT

approximation, function approximation, max 2, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

34f98c7c5d7063181da890ea8d25265a-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 05:04:50 GMT

approximation, assumption, function approximation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Learning and Sampling

Neural Information Processing SystemsFeb-8-2026, 04:53:34 GMT

A.1 Deep generative modelling A complete trajectory is denoted by ζ " t s The log-likelihood function is: Lpθ q " ÿ Applying this simple identiy, we also have: 0 " E On the other hand, it discourages action samples directly sampled from the prior. To ensure the transition model's validity, it needs to be grounded in real-world dynamics when jointly learned with the policy. Otherwise, the agent would be purely hallucinating based on the demonstrations. It would not be a problem if the action space is quantized. Intuitively, action samples at each step are updated with the energy of all subsequent actions and a single-step forward by back-propagation. To train the policy, Eq. (8) can now be rewritten as δ Eq. (5) is an empirical estimate of E We first prove the construction above is valid at optimality.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

3c56fe2f24038c4d22b9eb0aca78f590-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 03:56:18 GMT

algorithm, oracle, oracle policy, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.95)

Add feedback

2f060912eacace9ce61ef339205ec54c-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 03:55:05 GMT

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(6 more...)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Add feedback

2f060912eacace9ce61ef339205ec54c-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 03:55:02 GMT

convergence, nash policy, stochastic game, (11 more...)

Neural Information Processing Systems

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.04)
(8 more...)

Genre: Research Report (0.46)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.95)

Add feedback

124256ed80af5d4bf4c4de17b66c4298-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 03:46:22 GMT

gdpo, graph generation, trajectory, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)
Africa > Zambia > Southern Province > Choma (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

TheValue-EquivalencePrinciple forModel-Based ReinforcementLearning SupplementaryMaterial

Neural Information Processing SystemsFeb-8-2026, 03:45:55 GMT

Moreover, we include an additional result which illustrates a situation in which approximate VE models can outperform the MLEmodel. For each (i,j) pair, the above expression is suggestive of a dot-product between twon m vectors: a combination ofai and cj, and a "flattened" version ofB. Define the former combination of vectors asdij = [ai1cj1,ai1cj2,,aincjm]> Rnm 1, and stack them as rows as: D =[d11,d12,,dnm]> Rk` nm.ToflattenB,simplydefineb=[B11,B12,,Bnm]> Finally notice that the construction ofdij can be thought of as vertically stackingn copies ofcj eachscaledbyadifferententryin ai. This means that scaled copies of bothai and cj can be found by selecting specific groups of indices indij. It follows that ifa1,...,an are linearly independent then so ared1j,...,dnj for any j.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback