Not enough data to create a plot.
Try a different view from the menu above.
Bedi, Amrit
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences
Singh, Anukriti, Bhaskar, Amisha, Yu, Peihong, Chakraborty, Souradip, Dasyam, Ruthwik, Bedi, Amrit, Tokekar, Pratap
Designing reward functions for continuous-control robotics often leads to subtle misalignments or reward hacking, especially in complex tasks. Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback rather than hand-crafted signals, yet scaling human annotations remains challenging. Recent work uses Vision-Language Models (VLMs) to automate preference labeling, but a single final-state image generally fails to capture the agent's full motion. In this paper, we present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy. First, we overlay trajectory sketches on final observations to reveal the path taken, allowing VLMs to provide more reliable preferences-improving preference accuracy by approximately 15-20% in metaworld tasks. Second, we regularize reward learning by incorporating the agent's performance, ensuring that the reward model is optimized based on data generated by the current policy; this addition boosts episode returns by 20-30% in locomotion tasks. Empirical studies on metaworld demonstrate that our method achieves, for instance, around 70-80% success rate in all tasks, compared to below 50% for standard approaches. These results underscore the efficacy of combining richer visual representations with agent-aware reward regularization.
SAIL: Self-Improving Efficient Online Alignment of Large Language Models
Ding, Mucong, Chakraborty, Souradip, Agrawal, Vibhu, Che, Zora, Koppel, Alec, Wang, Mengdi, Bedi, Amrit, Huang, Furong
As artificial intelligence (AI) systems surpass human capabilities in various tasks, ensuring alignment with human values and ethics is crucial. This is especially important for large language models (LLMs), which are trained on diverse datasets that may contain harmful content. Reinforcement Learning from Human Feedback (RLHF) is an effective method for AI alignment, with models like OpenAI's GPT-4, Google's Gemini, and Anthropic Claude showing safe and aligned behaviors. However, the vast majority of the current research in RLHF (Agarwal et al., 2020; Rafailov et al., 2023; Ouyang et al., 2022; Chakraborty et al., 2024; Swamy et al., 2024) focuses on the offline setting, which involves using a fixed dataset of responses generated by the supervised fine-tuned model (SFT), ranked by human experts. Consequently, these methods are inherently offline and heavily reliant on the quality of the offline data generated by the SFT model, which exhibits drawbacks such as insufficient coverage of response-query pairs leading to sub-optimal alignment. To deal with the above shortcomings, recent literature (Guo et al., 2024a; Sharma et al., 2024; Lee et al., 2023; Yuan et al., 2024b) has focused on designing online RLHF algorithms. The setting of online RLHF transcends the constraints of a static offline dataset and aims to address two critical questions: Q1: How should we generate new responses during fine-tuning?
Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning
Yu, Peihong, Mishra, Manav, Koppel, Alec, Busart, Carl, Narayan, Priya, Manocha, Dinesh, Bedi, Amrit, Tokekar, Pratap
Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce a novel concept of personalized expert demonstrations, tailored for each individual agent or, more broadly, each individual type of agent within a heterogeneous team. These demonstrations solely pertain to single-agent behaviors and how each agent can achieve personal goals without encompassing any cooperative elements, thus naively imitating them will not achieve cooperation due to potential conflicts. To this end, we propose an approach that selectively utilizes personalized expert demonstrations as guidance and allows agents to learn to cooperate, namely personalized expert-guided MARL (PegMARL). This algorithm utilizes two discriminators: the first provides incentives based on the alignment of policy behavior with demonstrations, and the second regulates incentives based on whether the behavior leads to the desired objective. We evaluate PegMARL using personalized demonstrations in both discrete and continuous environments. The results demonstrate that PegMARL learns near-optimal policies even when provided with suboptimal demonstrations, and outperforms state-of-the-art MARL algorithms in solving coordinated tasks. We also showcase PegMARL's capability to leverage joint demonstrations in the StarCraft scenario and converge effectively even with demonstrations from non-co-trained policies.