palm-e
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM
Ao, Shuang, Salim, Flora D., Khan, Simon
Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.
Representing Online Handwriting for Recognition in Large Vision-Language Models
Fadeeva, Anastasiia, Schlattner, Philippe, Maksai, Andrii, Collier, Mark, Kokiopoulou, Efi, Berent, Jesse, Musat, Claudiu
The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Switzerland (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Google's PaLM-E is a generalist robot brain that takes commands
On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E, a multimodal embodied visual-language model (VLM) with 562 billion parameters that integrates vision and language for robotic control. They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining. According to Google, when given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm (developed by Google Robotics) and execute the actions by itself. PaLM-E does this by analyzing data from the robot's camera without needing a pre-processed scene representation. This eliminates the need for a human to pre-process or annotate the data and allows for more autonomous robotic control.
PaLM-E: An Embodied Multimodal Language Model
Driess, Danny, Xia, Fei, Sajjadi, Mehdi S. M., Lynch, Corey, Chowdhery, Aakanksha, Ichter, Brian, Wahid, Ayzaan, Tompson, Jonathan, Vuong, Quan, Yu, Tianhe, Huang, Wenlong, Chebotar, Yevgen, Sermanet, Pierre, Duckworth, Daniel, Levine, Sergey, Vanhoucke, Vincent, Hausman, Karol, Toussaint, Marc, Greff, Klaus, Zeng, Andy, Mordatch, Igor, Florence, Pete
Large language models (LLMs) demonstrate strong reasoning Large language models have been demonstrated to perform capabilities across various domains, including dialogue complex tasks. However, enabling general inference in the (Glaese et al., 2022; Thoppilan et al., 2022), step-by-step real world, e.g. for robotics problems, raises the challenge reasoning (Wei et al., 2022; Kojima et al., 2022), math problem of grounding. We propose embodied language models to directly solving (Lewkowycz et al., 2022; Polu et al., 2022), and incorporate real-world continuous sensor modalities code writing (Chen et al., 2021a). However, a limitation of into language models and thereby establish the link between such models for inference in the real world is the issue of words and percepts. Input to our embodied language grounding: while training LLMs on massive textual data model are multi-modal sentences that interleave visual, continuous may lead to representations that relate to our physical world, state estimation, and textual input encodings. We connecting those representations to real-world visual and train these encodings end-to-end, in conjunction with a pretrained physical sensor modalities is essential to solving a wider large language model, for multiple embodied tasks range of grounded real-world problems in computer vision including sequential robotic manipulation planning, visual and robotics (Tellex et al., 2020).