TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization

Chen, Zengjue, Niu, Runliang, Kong, He, Wang, Qi, Xing, Qianli, Fan, Zipei

arXiv.org Artificial Intelligence 

C. VLA Post-training Framework in Simulation After improving the computation of relative advantages and deriving the corresponding optimization objective, we integrated these components into a complete online reinforcement learning framework for VLA post-training in simulation. First, our overall framework trains a VLA model for a single task using reinforcement learning across multiple environments initialized with identical states. In this setup, the VLA executes the same task in parallel environments, sampling actions step by step until either one environment completes the task or all environments reach the maximum number of steps. During sampling, we incorporate the multistage reward function designed by the LLM described earlier, where each environment's observations provide the necessary object positions and robot state information required for reward computation. Once the trajectories terminate simultaneously, they all share the same length, which facilitates consistent grouping for subsequent processing. After collecting multiple trajectories, they are organized into a trajectory-level group, where relative advantages are computed within the group according to Eq. (3), yielding trajectory-level relative advantages. Similarly, since all trajectories terminate at the same timestep (ensuring that every step can be grouped), we extract step-level data across trajectories (e.g., rewards and log probabilities of actions), and group together steps at the same timestep to form step-level groups. Step-level relative advantages are then computed using Eq.