Goto

Collaborating Authors

 sokoban


Contrastive Representations for Temporal Reasoning

Neural Information Processing Systems

In classical AI, perception relies on learning state-based representations, while planning -- temporal reasoning over action sequences -- is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Contrastive Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik's Cube. In particular, for the Rubik's Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS -- though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.



W(leaf,i) r+ γ V(s0) s env.RESET() solution [ ].List of actions N(leaf,i) 1 for 1 Lp do Q(leaf,i) W(leaf,i) actions PLANNER(s) function UPDATE(path, leaf)

Neural Information Processing Systems

A.1 MCTS-kSubS algorithm In Algorithm 4 we present a general MCTS solver based on AlphaZero. Solver repeatedly queries the planner for a list of actions and executes them one by one. Baseline planner returns only a single action at a time, whereas MCTS-kSubS gives around kactions - to reach the desired subgoal (number of actions depends on a subgoal distance, which not always equals k in practice). MCTS-kSubS operates on a high-level subgoal graph: nodes are subgoals proposed by the generator (see Algorithm 3) and edges - lists of actions informing how to move from one subgoal to another (computed by the low-level conditional policy in Algorithm 2). The graph structure is represented by treevariable. For every subgoal, it keeps up to C3 best nearby subgoals (according to generator scores) along with a mentioned list of actions and sum of rewards to obtain while moving from the parent to the child subgoal. Most of MCTS implementation is shared between MCTS-kSubS and AlphaZero baseline, as we can treat the behavioral-cloning policy as a subgoal generator with k = 1. MCTS-kSubS and the baseline are encapsulated in GEN_CHILDREN function (Algorithms 5 and 6).



A Potential Negative Societal Impacts

Neural Information Processing Systems

We have not trained our models with sensitive or private data, and we emphasize that our model's direct L( n) other than the constant one as long as g (n) and l ( n) are positively correlated. The results for the baselines AdaSubS, kSubS, BC, CQL, DT, and HIPS with learned models were copied from [18]. The total number of GPU hours used on this work was approximately 7,500. We used 6 CPU workers (AMD Trento) per GPU. In the latter case, completeness cannot be guaranteed.





Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

arXiv.org Artificial Intelligence

Developing robust world model reasoning is crucial for large language model (LLM) agents to plan and interact in complex environments. While multi-turn interaction offers a superior understanding of environmental dynamics via authentic feedback, current approaches often impose a rigid reasoning process, which constrains the model's active learning, ultimately hindering efficient world model reasoning. T o address these issues, we explore world-model internalization through efficient interaction and active reasoning (WMAct), which liberates the model from structured reasoning--allowing the model to shape thinking directly through its doing--and achieves effective and efficient world model reasoning with two key mechanisms: (1) a reward rescaling mechanism adjusting outcome reward based on action efficacy to incentivize redundancy reduction and purposeful interaction; (2) an interaction frequency annealing strategy to progressively reduce the maximum allowed interaction turns, which compels the model to condense its learning and internalize environmental dynamics rather than over-relying on environmental cues. Our experiments on Sokoban, Maze, and T axi show that WMAct yields effective world model reasoning capable of resolving tasks in a single turn that previously required multiple interactions and fosters strong transferability to complex environments, improving performance on a suite of reasoning benchmarks.