visual target
Annotation-Free One-Shot Imitation Learning for Multi-Step Manipulation Tasks
Wichitwechkarn, Vijja, Williams, Emlyn, Fox, Charles, Choudhary, Ruchi
Abstract-- Recent advances in one-shot imitation learning have enabled robots to acquire new manipulation skills from a single human demonstration. While existing methods achieve strong performance on single-step tasks, they remain limited in their ability to handle long-horizon, multi-step tasks without additional model training or manual annotation. We propose a method that can be applied to this setting provided a single demonstration without additional model training or manual annotation. We evaluated our method on multi-step and single-step manipulation tasks where our method achieves an average success rate of 82.5% and 90%, respectively. Our method matches and exceeds the performance of the baselines in both these cases. We also compare the performance and computational efficiency of alternative pre-trained feature extractors within our framework. I. INTRODUCTION Recent advances in imitation learning have enabled robots to perform increasingly complex tasks. However, these methods still require hundreds to thousands of demonstrations per task [1], [2], [3], [4], making them impractical for real-world deployment.
Representing Positional Information in Generative World Models for Object Manipulation
Ferraro, Stefano, Mazzaglia, Pietro, Verbelen, Tim, Dhoedt, Bart, Rajeswar, Sai
Object manipulation capabilities are essential skills that set apart embodied agents engaging with the world, especially in the realm of robotics. The ability to predict outcomes of interactions with objects is paramount in this setting. While model-based control methods have started to be employed for tackling manipulation tasks, they have faced challenges in accurately manipulating objects. As we analyze the causes of this limitation, we identify the cause of underperformance in the way current world models represent crucial positional information, especially about the target's goal specification for object positioning tasks. We introduce a general approach that empowers world model-based agents to effectively solve object-positioning tasks. We propose two declinations of this approach for generative world models: position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In particular, LCP employs object-centric latent representations that explicitly capture object positional information for goal specification. This naturally leads to the emergence of multimodal capabilities, enabling the specification of goals through spatial coordinates or a visual goal. Our methods are rigorously evaluated across several manipulation environments, showing favorable performance compared to current model-based control approaches.
Learning to Edit Visual Programs with Self-Supervision
Jones, R. Kenny, Zhang, Renhao, Ganeshan, Aditya, Ritchie, Daniel
We design a system that learns how to edit visual programs. Our edit network consumes a complete input program and a visual target. From this input, we task our network with predicting a local edit operation that could be applied to the input program to improve its similarity to the target. In order to apply this scheme for domains that lack program annotations, we develop a self-supervised learning approach that integrates this edit network into a bootstrapped finetuning loop along with a network that predicts entire programs in one-shot. Our joint finetuning scheme, when coupled with an inference procedure that initializes a population from the one-shot model and evolves members of this population with the edit network, helps to infer more accurate visual programs. Over multiple domains, we experimentally compare our method against the alternative of using only the one-shot model, and find that even under equal search-time budgets, our editing-based paradigm provides significant advantages.
Tiny Eye Movements Are Under a Surprising Degree of Cognitive Control - Neuroscience News
Summary: Ocular drift, or tiny eye movements that seem random can be influenced by prior knowledge of an expected visual target, researchers report. A very subtle and seemingly random type of eye movement called ocular drift can be influenced by prior knowledge of the expected visual target, suggesting a surprising level of cognitive control over the eyes, according to a study led by Weill Cornell Medicine neuroscientists. The discovery, described Apr. 3 in Current Biology, adds to the scientific understanding of how vision--far from being a mere absorption of incoming signals from the retina--is controlled and directed by cognitive processes. "These eye movements are so tiny that we're not even conscious of them, and yet our brains somehow can use the knowledge of the visual task to control them," says study lead author Dr. Yen-Chu Lin, who carried out the work as a Fred Plum Fellow in Systems Neurology and Neuroscience in the Feil Family Brain and Mind Research Institute at Weill Cornell Medicine. Dr. Lin works in the laboratory of study senior author Dr. Jonathan Victor, the Fred Plum Professor of Neurology at Weill Cornell Medicine. The study involved a close collaboration with the laboratory of Dr. Michele Rucci, professor of brain and cognitive sciences and neuroscience at the University of Rochester.
Further Explorations in Visually-Guided Reaching: Making MURPHY Smarter
MURPHY is a vision-based kinematic controller and path planner based on a connectionist architecture, and implemented with a video camera and Rhino XR-series robot arm. Imitative of the layout of sen(cid:173) sory and motor maps in cerebral cortex, MURPHY'S internal representa(cid:173) tions consist of four coarse-coded populations of simple units represent(cid:173) ing both static and dynamic aspects of the sensory-motor environment. In previously reported work [4], MURPHY first learned a direct kinematic model of his camera-arm system during a period of extended practice, and then used this "mental model" to heuristically guide his hand to unobstructed visual targets. MURPHY has since been extended in two ways: First, he now learns the inverse differential-kinematics of his arm in addition to ordinary direct kinematics, which allows him to push his hand directly towards a visual target without the need for search. Sec(cid:173) ondly, he now deals with the much more difficult problem of reaching in the presence of obstacles.
Unifying the Sensory and Motor Components of Sensorimotor Adaptation
Haith, Adrian, Jackson, Carl P., Miall, R. C., Vijayakumar, Sethu
Adaptation of visually guided reaching movements in novel visuomotor environments (e.g.wearing prism goggles) comprises not only motor adaptation but also substantial sensory adaptation, corresponding to shifts in the perceived spatial location of visual and proprioceptive cues. Previous computational modelsof the sensory component of visuomotor adaptation have assumed that it is driven purely by the discrepancy introduced between visual andproprioceptive estimates of hand position and is independent of any motor component of adaptation. We instead propose a unified model in which sensory and motor adaptation are jointly driven by optimal Bayesian estimation of the sensory and motor contributions to perceived errors. Our model is able to account for patterns of performance errors during visuomotor adaptationas well as the subsequent perceptual aftereffects. This unified model also makes the surprising prediction that force field adaptation willelicit similar perceptual shifts, even though there is never any discrepancy between visual and proprioceptive observations. We confirm this prediction with an experiment.