SAGE:State-Aware Guided End-to-End Policy for Multi-Stage Sequential Tasks via Hidden Markov Decision Process
Wu, BinXu, Zhang, TengFei, Yang, Chen, Wen, JiaHao, Li, HaoCheng, Ma, JingTian, Chen, Zhen, Wang, JingYuan
–arXiv.org Artificial Intelligence
Abstract--Multi-stage sequential (MSS) robotic manipulation tasks are prevalent and crucial in robotics. They often involve state ambiguity, where visually similar observations correspond to different actions. We present SAGE, a state-aware guided imitation learning framework that models tasks as a Hidden Markov Decision Process (HMDP) to explicitly capture latent task stages and resolve ambiguity. We instantiate the HMDP with a state transition network that infers hidden states, and a state-aware action policy that conditions on both observations and hidden states to produce actions, thereby enabling disambiguation across task stages. T o reduce manual annotation effort, we propose a semi-automatic labeling pipeline combining active learning and soft label interpolation. In real-world experiments across multiple complex MSS tasks with state ambiguity, SAGE achieved 100% task success under the standard evaluation protocol, markedly surpassing the baselines. Ablation studies further show that such performance can be maintained with manual labeling for only about 13% of the states, indicating its strong effectiveness. OBOTIC manipulation tasks have attracted significant attention due to their broad applications. Vision-based strategies have been widely adopted [1], and have demonstrated remarkable performance across a variety of real-world scenarios [2], [3], [4], [5], [6]. However, a particular class of tasks--Multi-Stage Sequential (MSS) tasks--introduces distinctive challenges to vision-based policies. MSS tasks are characterized by a sequence of interdependent stages that must be executed in a prescribed temporal order, often requiring the policy to perform long-horizon reasoning, retain contextual information from prior steps, and ensure coherent progression across successive stages. In such cases, visually similar observations may correspond to different actions, resulting in ambiguity during action selection. An illustrative case is the Push Buttons task shown in Figure 1. The visual observations at stages 1-1, 2-1, and 3-1 are nearly indistinguishable; however, the correct action--pressing the yellow, pink, or blue button--requires knowledge of the current task stage to be correctly determined.
arXiv.org Artificial Intelligence
Sep-25-2025