Goto

Collaborating Authors

 interaction step



We tested our method on Humanoid-v2 and confirmed our method works

Neural Information Processing Systems

We thank the reviewers for the reviews, providing meaningful insight with constructive feedback. The result was reversed in Hopper, where RL contributed 200.86 while EA actors did 363.53. Therefore, all performance result scores are measured in the fixed interaction step. R2: Ablation study is missing. We presented the effect of the variance update rule in Appendix C.3 by comparing the result Then, we provided all combinations of our proposed mean and variance in Table 2. We will add a section so that it can be seen at a glance.



SUPPLEMENTARY MATERIAL Deep Reinforcement Learning with Stacked Hierarchical Attention for T based Games

Neural Information Processing Systems

Figure 1 shows an example of the raw interface of the game "ztuu", where raw textual observations In this section, we show the first 15 interaction steps of two games: "zork1" and "ztuu". C h o s e n a c t i o n a n d r e w a r d A c t i o n: w e s t Reward: 0 | S c o r e: 0 ===== S t e p 2 ===== ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: s o u t h Reward: 0 | S c o r e: 0 ===== S t e p 3 ===== 16 ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: s o u t h Reward: 0 | S c o r e: 0 ===== S t e p 4 ===== ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: w e s t Reward: 0 | S c o r e: 0 ===== S t e p 5 ===== ===== 1 .


Inference-Time Personalized Alignment with a Few User Preference Queries

Pădurean, Victor-Alexandru, Kamalaruban, Parameswaran, Kotalwar, Nachiket, Gotovos, Alkis, Singla, Adish

arXiv.org Artificial Intelligence

We study the problem of aligning a generative model's response with a user's preferences. Recent works have proposed several different formulations for personalized alignment; however, they either require a large amount of user preference queries or require that the preference be explicitly specified as a text input. In this paper, we propose a novel inference-time personalized alignment method, UserAlign, that elicits the user's preferences with a few queries as pairwise response comparisons. In particular, UserAlign builds on the theoretical framework of best-arm identification in logistic bandits and selects a personalized response from a fixed pool of the model's generated responses. The key idea is to consider the user's feedback consistent and noise-free, and incorporate it into the theoretical framework to identify the best response quickly. Experimental results across several tasks, involving personalized text and image generation, showcase the effectiveness of UserAlign in achieving personalized alignment.




Information Seeking for Robust Decision Making under Partial Observability

Fang, Djengo Cyun-Jyun, Ke, Tsung-Wei

arXiv.org Artificial Intelligence

Explicit information seeking is essential to human problem-solving in practical environments characterized by incomplete information and noisy dynamics. When the true environmental state is not directly observable, humans seek information to update their internal dynamics and inform future decision-making. Although existing Large Language Model (LLM) planning agents have addressed observational uncertainty, they often overlook discrepancies between their internal dynamics and the actual environment. We introduce Information Seeking Decision Planner (InfoSeeker), an LLM decision-making framework that integrates task-oriented planning with information seeking to align internal dynamics and make optimal decisions under uncertainty in both agent observations and environmental dynamics. InfoSeeker prompts an LLM to actively gather information by planning actions to validate its understanding, detect environmental changes, or test hypotheses before generating or revising task-oriented plans. To evaluate InfoSeeker, we introduce a novel benchmark suite featuring partially observable environments with incomplete observations and uncertain dynamics. Experiments demonstrate that InfoSeeker achieves a 74% absolute performance gain over prior methods without sacrificing sample efficiency. Moreover, InfoSeeker generalizes across LLMs and outperforms baselines on established benchmarks such as robotic manipulation and web navigation. These findings underscore the importance of tightly integrating planning and information seeking for robust behavior in partially observable environments. The project page is available at https://infoseekerllm.github.io


CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering

Zhao, Yang, Dai, Chengxiao, Zhuo, Wei, Xiu, Yue, Niyato, Dusit

arXiv.org Artificial Intelligence

Knowledge graphs provide structured context for multi-hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static k-hop expansions and "think-longer" prompting often over-retrieve, inflate context, and yield unpredictable runtime. We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep, and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user-specified budgets or prices, allowing per-query adaptation to trade-offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian-Constrained Multi-Agent Proximal Policy Optimization (LC-MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning-path discovery, and evidence selection are jointly optimized under per-query resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves +39.3 EM@1 with 18.6% lower latency and 40.9% lower edge growth. The resulting contexts are compact, provenance-preserving, and deliver predictable performance under deployment constraints.


SUPPLEMENTARY MATERIAL Deep Reinforcement Learning with Stacked Hierarchical Attention for T based Games

Neural Information Processing Systems

Figure 1 shows an example of the raw interface of the game "ztuu", where raw textual observations In this section, we show the first 15 interaction steps of two games: "zork1" and "ztuu". C h o s e n a c t i o n a n d r e w a r d A c t i o n: w e s t Reward: 0 | S c o r e: 0 ===== S t e p 2 ===== ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: s o u t h Reward: 0 | S c o r e: 0 ===== S t e p 3 ===== 16 ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: s o u t h Reward: 0 | S c o r e: 0 ===== S t e p 4 ===== ===== 1 . C h o s e n a c t i o n a n d r e w a r d A c t i o n: w e s t Reward: 0 | S c o r e: 0 ===== S t e p 5 ===== ===== 1 .