AITopics | palm-e

Collaborating Authors

palm-e

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM

Ao, Shuang, Salim, Flora D., Khan, Simon

arXiv.org Artificial IntelligenceOct-17-2025

Although LLMs demonstrate proficiency in several text-based reasoning and planning tasks, their implementation in robotics control is constrained by significant deficiencies: (1) LLM agents are designed to work mainly with textual inputs rather than visual conditions; (2) Current multimodal agents treat LLMs as static planners, which separates their reasoning from environment dynamics, resulting in actions that do not take domain-specific knowledge into account; and (3) LLMs are not designed to learn from visual interactions, which makes it harder for them to make better policies for specific domains. In this paper, we introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM via a bidirectional training paradigm. Unlike existing methods, EMAC+ dynamically refines high-level textual plans generated by an LLM using real-time feedback from a VLM executing low-level visual control tasks. We address critical limitations of previous models by enabling the LLM to internalize visual environment dynamics directly through interactive experience, rather than relying solely on static symbolic mappings. Extensive experimental evaluations on ALFWorld and RT-1 benchmarks demonstrate that EMAC+ achieves superior task performance, robustness against noisy observations, and efficient learning. We also conduct thorough ablation studies and provide detailed analyses of success and failure cases.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.19905

Genre: Research Report > New Finding (0.46)

Industry: Education (0.67)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Representing Online Handwriting for Recognition in Large Vision-Language Models

Fadeeva, Anastasiia, Schlattner, Philippe, Maksai, Andrii, Collier, Mark, Kokiopoulou, Efi, Berent, Jesse, Musat, Claudiu

arXiv.org Artificial IntelligenceFeb-23-2024

The adoption of tablets with touchscreens and styluses is increasing, and a key feature is converting handwriting to text, enabling search, indexing, and AI assistance. Meanwhile, vision-language models (VLMs) are now the go-to solution for image understanding, thanks to both their state-of-the-art performance across a variety of tasks and the simplicity of a unified approach to training, fine-tuning, and inference. While VLMs obtain high performance on image-based tasks, they perform poorly on handwriting recognition when applied naively, i.e., by rendering handwriting as an image and performing optical character recognition (OCR). In this paper, we study online handwriting recognition with VLMs, going beyond naive OCR. We propose a novel tokenized representation of digital ink (online handwriting) that includes both a time-ordered sequence of strokes as text, and as image. We show that this representation yields results comparable to or better than state-of-the-art online handwriting recognizers. Wide applicability is shown through results with two different VLM families, on multiple public datasets. Our approach can be applied to off-the-shelf VLMs, does not require any changes in their architecture, and can be used in both fine-tuning and parameter-efficient tuning. We perform a detailed ablation study to identify the key elements of the proposed representation.

recognition, representation, representing online handwriting, (14 more...)

arXiv.org Artificial Intelligence

2402.15307

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland (0.04)
Europe > Italy > Piedmont > Turin Province > Turin (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision > Handwriting Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Google's PaLM-E is a generalist robot brain that takes commands

#artificialintelligenceMar-8-2023, 09:32:29 GMT

On Monday, a group of AI researchers from Google and the Technical University of Berlin unveiled PaLM-E, a multimodal embodied visual-language model (VLM) with 562 billion parameters that integrates vision and language for robotic control. They claim it is the largest VLM ever developed and that it can perform a variety of tasks without the need for retraining. According to Google, when given a high-level command, such as "bring me the rice chips from the drawer," PaLM-E can generate a plan of action for a mobile robot platform with an arm (developed by Google Robotics) and execute the actions by itself. PaLM-E does this by analyzing data from the robot's camera without needing a pre-processed scene representation. This eliminates the need for a human to pre-process or annotate the data and allows for more autonomous robotic control.

google, palm-e, robotic control, (13 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

PaLM-E: An Embodied Multimodal Language Model

Driess, Danny, Xia, Fei, Sajjadi, Mehdi S. M., Lynch, Corey, Chowdhery, Aakanksha, Ichter, Brian, Wahid, Ayzaan, Tompson, Jonathan, Vuong, Quan, Yu, Tianhe, Huang, Wenlong, Chebotar, Yevgen, Sermanet, Pierre, Duckworth, Daniel, Levine, Sergey, Vanhoucke, Vincent, Hausman, Karol, Toussaint, Marc, Greff, Klaus, Zeng, Andy, Mordatch, Igor, Florence, Pete

arXiv.org Artificial IntelligenceMar-6-2023

Large language models (LLMs) demonstrate strong reasoning Large language models have been demonstrated to perform capabilities across various domains, including dialogue complex tasks. However, enabling general inference in the (Glaese et al., 2022; Thoppilan et al., 2022), step-by-step real world, e.g. for robotics problems, raises the challenge reasoning (Wei et al., 2022; Kojima et al., 2022), math problem of grounding. We propose embodied language models to directly solving (Lewkowycz et al., 2022; Polu et al., 2022), and incorporate real-world continuous sensor modalities code writing (Chen et al., 2021a). However, a limitation of into language models and thereby establish the link between such models for inference in the real world is the issue of words and percepts. Input to our embodied language grounding: while training LLMs on massive textual data model are multi-modal sentences that interleave visual, continuous may lead to representations that relate to our physical world, state estimation, and textual input encodings. We connecting those representations to real-world visual and train these encodings end-to-end, in conjunction with a pretrained physical sensor modalities is essential to solving a wider large language model, for multiple embodied tasks range of grounded real-world problems in computer vision including sequential robotic manipulation planning, visual and robotics (Tellex et al., 2020).

large language model, natural language, palm-e, (17 more...)

arXiv.org Artificial Intelligence

2303.03378

Country: North America > United States > New York (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Basketball (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback