Wang, Yanwei
Inference-Time Policy Steering through Human Interactions
Wang, Yanwei, Wang, Lirui, Du, Yilun, Sundaralingam, Balakumar, Yang, Xuning, Chao, Yu-Wei, Perez-D'Arpino, Claudia, Fox, Dieter, Shah, Julie
Generative policies trained with human demonstrations can autonomously accomplish multimodal, long-horizon tasks. However, during inference, humans are often removed from the policy execution loop, limiting the ability to guide a pre-trained policy towards a specific sub-goal or trajectory shape among multiple predictions. Naive human intervention may inadvertently exacerbate distribution shift, leading to constraint violations or execution failures. To better align policy output with human intent without inducing out-of-distribution errors, we propose an Inference-Time Policy Steering (ITPS) framework that leverages human interactions to bias the generative sampling process, rather than fine-tuning the policy on interaction data. We evaluate ITPS across three simulated and real-world benchmarks, testing three forms of human interaction and associated alignment distance metrics. Among six sampling strategies, our proposed stochastic sampling with diffusion policy achieves the best trade-off between alignment and distribution shift. Videos are available at https://yanweiw.github.io/itps/.
Versatile Demonstration Interface: Toward More Flexible Robot Demonstration Collection
Hagenow, Michael, Kontogiorgos, Dimosthenis, Wang, Yanwei, Shah, Julie
Previous methods for Learning from Demonstration leverage several approaches for a human to teach motions to a robot, including teleoperation, kinesthetic teaching, and natural demonstrations. However, little previous work has explored more general interfaces that allow for multiple demonstration types. Given the varied preferences of human demonstrators and task characteristics, a flexible tool that enables multiple demonstration types could be crucial for broader robot skill training. In this work, we propose Versatile Demonstration Interface (VDI), an attachment for collaborative robots that simplifies the collection of three common types of demonstrations. Designed for flexible deployment in industrial settings, our tool requires no additional instrumentation of the environment. Our prototype interface captures human demonstrations through a combination of vision, force sensing, and state tracking (e.g., through the robot proprioception or AprilTag tracking). Through a user study where we deployed our prototype VDI at a local manufacturing innovation center with manufacturing experts, we demonstrated the efficacy of our prototype in representative industrial tasks. Interactions from our study exposed a range of industrial use cases for VDI, clear relationships between demonstration preferences and task criteria, and insights for future tool design.
Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
Wang, Yanwei, Wang, Tsun-Hsuan, Mao, Jiayuan, Hagenow, Michael, Shah, Julie
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide
Human-Machine Cooperative Multimodal Learning Method for Cross-subject Olfactory Preference Recognition
Xia, Xiuxin, Guo, Yuchen, Wang, Yanwei, Yang, Yuchao, Shi, Yan, Men, Hong
Odor sensory evaluation has a broad application in food, clothing, cosmetics, and other fields. Traditional artificial sensory evaluation has poor repeatability, and the machine olfaction represented by the electronic nose (E-nose) is difficult to reflect human feelings. Olfactory electroencephalogram (EEG) contains odor and individual features associated with human olfactory preference, which has unique advantages in odor sensory evaluation. However, the difficulty of cross-subject olfactory EEG recognition greatly limits its application. It is worth noting that E-nose and olfactory EEG are more advantageous in representing odor information and individual emotions, respectively. In this paper, an E-nose and olfactory EEG multimodal learning method is proposed for cross-subject olfactory preference recognition. Firstly, the olfactory EEG and E-nose multimodal data acquisition and preprocessing paradigms are established. Secondly, a complementary multimodal data mining strategy is proposed to effectively mine the common features of multimodal data representing odor information and the individual features in olfactory EEG representing individual emotional information. Finally, the cross-subject olfactory preference recognition is achieved in 24 subjects by fusing the extracted common and individual features, and the recognition effect is superior to the state-of-the-art recognition methods. Furthermore, the advantages of the proposed method in cross-subject olfactory preference recognition indicate its potential for practical odor evaluation applications.
Improving Small Language Models on PubMedQA via Generative Data Augmentation
Guo, Zhen, Wang, Peiqi, Wang, Yanwei, Yu, Shangdi
Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. However, their increasing size poses challenges in terms of computational cost. On the other hand, Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data, especially in specific domains. In this paper, we introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation. The objective of our approach is to develop more efficient and capable models that are specifically tailored for specialized applications. Through experiments conducted on the PubMedQA dataset, we demonstrate the effectiveness of LLMs in refining and diversifying existing question-answer pairs. This refinement process leads to improved performance in a significantly smaller model after fine-tuning. Notably, our best SLM, with under 1.6 billion parameters, outperforms the few-shot GPT-4 on the PubMedQA dataset. Our code and generated data are publicly available to facilitate further explorations [1].
Visual Pre-training for Navigation: What Can We Learn from Noise?
Wang, Yanwei, Ko, Ching-Yun, Agrawal, Pulkit
One powerful paradigm in visual navigation is to predict actions from observations directly. Training such an end-to-end system allows representations useful for downstream tasks to emerge automatically. However, the lack of inductive bias makes this system data inefficient. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on synthetic noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. The code is available at https://yanweiw.github.io/noise2ptz
Temporal Logic Imitation: Learning Plan-Satisficing Motion Policies from Demonstrations
Wang, Yanwei, Figueroa, Nadia, Li, Shen, Shah, Ankit, Shah, Julie
In prior work, learning from demonstration (LfD) [1, 2] has successfully enabled robots to accomplish multi-step tasks by segmenting demonstrations (primarily of robot end-effector or tool trajectories) into sub-tasks/goals [3, 4, 5, 6, 7, 8], phases [9, 10], keyframes [11, 12], or skills/primitives/options [13, 14, 15, 16]. Most of these abstractions assume reaching subgoals sequentially will deliver the desired outcomes; however, successful imitation of many manipulation tasks with spatial/temporal constraints cannot be reduced to imitation at the motion level unless the learned motion policy also satisfies these constraints. This becomes highly relevant if we want robots to not only imitate but also generalize, adapt and be robust to perturbations imposed by humans, who are in the loop of task learning and execution. LfD techniques that learn stable motion policies with convergence guarantees (e.g., Dynamic Movement Primitives (DMP) [17], Dynamical Systems (DS) [18]) are capable of providing such desired properties but only at the motion level. As shown in Figure 1 (a-b) a robot can successfully replay a soup-scooping task while being robust to physical perturbations with a learned DS. Nevertheless, if the spoon orientation is perturbed to a state where all material is dropped, as seen in Figure 1 (c), the motion policy will still lead the robot to the target, unaware of the task-level failure or how to recover from it.