AITopics | Duan, Jiafei

Collaborating Authors

Duan, Jiafei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Fang, Haoquan, Grotz, Markus, Pumacay, Wilbert, Wang, Yi Ru, Fox, Dieter, Krishna, Ranjay, Duan, Jiafei

arXiv.org Artificial IntelligenceFeb-11-2025

Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce SAM2Act, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of 86.8% across 18 tasks in the RLBench benchmark, and demonstrates robust generalization on The Colosseum benchmark, with only a 4.3% performance gap under diverse environmental perturbations. Building on this foundation, we propose SAM2Act+, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce MemoryBench, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves competitive performance on MemoryBench, significantly outperforming existing approaches and pushing the boundaries of memory-enabled robotic systems. Project page: https://sam2act.github.io/

machine learning, reinforcement learning, sam2act, (17 more...)

arXiv.org Artificial Intelligence

2501.18564

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
(2 more...)

Add feedback

SAT: Spatial Aptitude Training for Multimodal Language Models

Ray, Arijit, Duan, Jiafei, Tan, Reuben, Bashkirova, Dina, Hendrix, Rose, Ehsani, Kiana, Kembhavi, Aniruddha, Plummer, Bryan A., Krishna, Ranjay, Zeng, Kuo-Hao, Saenko, Kate

arXiv.org Artificial IntelligenceDec-10-2024

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at http://arijitray1993.github.io/SAT/ .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.07755

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report (0.81)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Duan, Jiafei, Pumacay, Wilbert, Kumar, Nishanth, Wang, Yi Ru, Tian, Shulin, Yuan, Wentao, Krishna, Ranjay, Fox, Dieter, Mandlekar, Ajay, Guo, Yijie

arXiv.org Artificial IntelligenceSep-30-2024

Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they still struggle with failure recognition, limiting their real-world applicability. We introduce AHA, an open-source VLM designed to detect and reason about failures in robotic manipulation using natural language. By framing failure detection as a free-form reasoning task, AHA identifies failures and provides detailed, adaptable explanations across different robots, tasks, and environments. We fine-tuned AHA using FailGen, a scalable framework that generates the first large-scale dataset of robotic failure trajectories, the AHA dataset. FailGen achieves this by procedurally perturbing successful demonstrations from simulation. Despite being trained solely on the AHA dataset, AHA generalizes effectively to real-world failure datasets, robotic systems, and unseen tasks. It surpasses the second-best model (GPT-4o in-context learning) by 10.3% and exceeds the average performance of six compared models including five state-of-the-art VLMs by 35.3% across multiple metrics and datasets. We integrate AHA into three manipulation frameworks that utilize LLMs/VLMs for reinforcement learning, task and motion planning, and zero-shot trajectory generation. AHA's failure feedback enhances these policies' performances by refining dense reward functions, optimizing task planning, and improving sub-task verification, boosting task success rates by an average of 21.4% across all three tasks compared to GPT-4 models.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.00371

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics

Yuan, Wentao, Duan, Jiafei, Blukis, Valts, Pumacay, Wilbert, Krishna, Ranjay, Murali, Adithyavairavan, Mousavian, Arsalan, Fox, Dieter

arXiv.org Artificial IntelligenceJun-15-2024

From rearranging objects on a table to putting groceries into shelves, robots must plan precise action points to perform tasks accurately and reliably. In spite of the recent adoption of vision language models (VLMs) to control robot behavior, VLMs struggle to precisely articulate robot actions using language. We introduce an automatic synthetic data generation pipeline that instruction-tunes VLMs to robotic domains and needs. Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions. Compared to alternative approaches, our method requires no real-world data collection or human demonstration, making it much more scalable to diverse environments and viewpoints. In addition, RoboPoint is a general model that enables several downstream applications such as robot navigation, manipulation, and augmented reality (AR) assistance. Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs (GPT-4o) and visual prompting techniques (PIVOT) by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks. Project website: https://robo-point.github.io.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2406.10721

Country: Europe > Netherlands (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.89)

Add feedback

Octopi: Object Property Reasoning with Large Tactile-Language Models

Yu, Samson, Lin, Kelvin, Xiao, Anxing, Duan, Jiafei, Soh, Harold

arXiv.org Artificial IntelligenceJun-4-2024

Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi.

large language model, machine learning, physical property, (18 more...)

arXiv.org Artificial Intelligence

2405.02794

Country: Asia (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.68)
(2 more...)

Add feedback

EVE: Enabling Anyone to Train Robots using Augmented Reality

Wang, Jun, Chang, Chun-Cheng, Duan, Jiafei, Fox, Dieter, Krishna, Ranjay

arXiv.org Artificial IntelligenceMay-18-2024

The increasing affordability of robot hardware is accelerating the integration of robots into everyday activities. However, training robots to automate tasks typically requires physical robots and expensive demonstration data from trained human annotators. Consequently, only those with access to physical robots produce demonstrations to train robots. To mitigate this issue, we introduce EVE, an iOS app that enables everyday users to train robots using intuitive augmented reality visualizations without needing a physical robot. With EVE, users can collect demonstrations by specifying waypoints with their hands, visually inspecting the environment for obstacles, modifying existing waypoints, and verifying collected trajectories. In a user study ($N=14$, $D=30$) consisting of three common tabletop tasks, EVE outperformed three state-of-the-art interfaces in success rate and was comparable to kinesthetic teaching-physically moving a real robot-in completion time, usability, motion intent communication, enjoyment, and preference ($mean_{p}=0.30$). We conclude by enumerating limitations and design considerations for future AR-based demonstration collection systems for robotics.

artificial intelligence, human computer interaction, robot, (15 more...)

arXiv.org Artificial Intelligence

2404.06089

Country: North America > United States > Massachusetts (0.14)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.62)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.46)

Add feedback

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

Pumacay, Wilbert, Singh, Ishika, Duan, Jiafei, Krishna, Ranjay, Thomason, Jesse, Fox, Dieter

arXiv.org Artificial IntelligenceFeb-12-2024

To realize effective large-scale, real-world robotic applications, we must evaluate how well our robot policies adapt to changes in environmental conditions. Unfortunately, a majority of studies evaluate robot performance in environments closely resembling or even identical to the training setup. We present THE COLOSSEUM, a novel simulation benchmark, with 20 diverse manipulation tasks, that enables systematical evaluation of models across 12 axes of environmental perturbations. These perturbations include changes in color, texture, and size of objects, table-tops, and backgrounds; we also vary lighting, distractors, and camera pose. Using THE COLOSSEUM, we compare 4 state-of-the-art manipulation models to reveal that their success rate degrades between 30-50% across these perturbation factors. When multiple perturbations are applied in unison, the success rate degrades $\geq$75%. We identify that changing the number of distractor objects, target object color, or lighting conditions are the perturbations that reduce model performance the most. To verify the ecological validity of our results, we show that our results in simulation are correlated ($\bar{R}^2 = 0.614$) to similar perturbations in real-world experiments. We open source code for others to use THE COLOSSEUM, and also release code to 3D print the objects used to replicate the real-world perturbations. Ultimately, we hope that THE COLOSSEUM will serve as a benchmark to identify modeling decisions that systematically improve generalization for manipulation. See https://robot-colosseum.github.io/ for more details.

artificial intelligence, machine learning, perturbation, (15 more...)

arXiv.org Artificial Intelligence

2402.08191

Country:

North America > United States > California (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Leisure & Entertainment (0.47)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Selective Visual Representations Improve Convergence and Generalization for Embodied AI

Eftekhar, Ainaz, Zeng, Kuo-Hao, Duan, Jiafei, Farhadi, Ali, Kembhavi, Ani, Krishna, Ranjay

arXiv.org Artificial IntelligenceNov-7-2023

Embodied AI models often employ off the shelf vision backbones like CLIP to encode their visual observations. Although such general purpose representations encode rich syntactic and semantic information about the scene, much of this information is often irrelevant to the specific task at hand. This introduces noise within the learning process and distracts the agent's focus from task-relevant visual cues. Inspired by selective attention in humans-the process through which people filter their perception based on their experiences, knowledge, and the task at hand-we introduce a parameter-efficient approach to filter visual stimuli for embodied AI. Our approach induces a task-conditioned bottleneck using a small learnable codebook module. This codebook is trained jointly to optimize task reward and acts as a task-conditioned selective filter over the visual observation. Our experiments showcase state-of-the-art performance for object goal navigation and object displacement across 5 benchmarks, ProcTHOR, ArchitecTHOR, RoboTHOR, AI2-iTHOR, and ManipulaTHOR. The filtered representations produced by the codebook are also able generalize better and converge faster when adapted to other simulation environments such as Habitat. Our qualitative analyses show that agents explore their environments more effectively and their representations retain task-relevant information like target object recognition while ignoring superfluous information about other objects. Code and pretrained models are available at our project website: https://embodied-codebook.github.io.

machine learning, natural language, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2311.04193

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
(3 more...)

Add feedback

NEWTON: Are Large Language Models Capable of Physical Reasoning?

Wang, Yi Ru, Duan, Jiafei, Fox, Dieter, Srinivasa, Siddhartha

arXiv.org Artificial IntelligenceOct-10-2023

Large Language Models (LLMs), through their contextualized representations, have been empirically proven to encapsulate syntactic, semantic, word sense, and common-sense knowledge. However, there has been limited exploration of their physical reasoning abilities, specifically concerning the crucial attributes for comprehending everyday objects. To address this gap, we introduce NEWTON, a repository and benchmark for evaluating the physics reasoning skills of LLMs. Further, to enable domain-specific adaptation of this benchmark, we present a pipeline to enable researchers to generate a variant of this benchmark that has been customized to the objects and attributes relevant for their application. The NEWTON repository comprises a collection of 2800 object-attribute pairs, providing the foundation for generating infinite-scale assessment templates. The NEWTON benchmark consists of 160K QA questions, curated using the NEWTON repository to investigate the physical reasoning capabilities of several mainstream language models across foundational, explicit, and implicit reasoning tasks. Through extensive empirical analysis, our results highlight the capabilities of LLMs for physical reasoning. We find that LLMs like GPT-4 demonstrate strong reasoning capabilities in scenario-based tasks but exhibit less consistency in object-attribute reasoning compared to humans (50% vs. 84%). Furthermore, the NEWTON platform demonstrates its potential for evaluating and enhancing language models, paving the way for their integration into physically grounded settings, such as robotic manipulation. Project site: https://newtonreasoning.github.io

deep learning, machine learning, physical reasoning, (5 more...)

arXiv.org Artificial Intelligence

2310.07018

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AR2-D2:Training a Robot Without a Robot

Duan, Jiafei, Wang, Yi Ru, Shridhar, Mohit, Fox, Dieter, Krishna, Ranjay

arXiv.org Artificial IntelligenceJun-23-2023

Diligently gathered human demonstrations serve as the unsung heroes empowering the progression of robot learning. Today, demonstrations are collected by training people to use specialized controllers, which (tele-)operate robots to manipulate a small number of objects. By contrast, we introduce AR2-D2: a system for collecting demonstrations which (1) does not require people with specialized training, (2) does not require any real robots during data collection, and therefore, (3) enables manipulation of diverse objects with a real robot. AR2-D2 is a framework in the form of an iOS app that people can use to record a video of themselves manipulating any object while simultaneously capturing essential data modalities for training a real robot. We show that data collected via our system enables the training of behavior cloning agents in manipulating real objects. Our experiments further show that training with our AR data is as effective as training with real-world robot demonstrations. Moreover, our user study indicates that users find AR2-D2 intuitive to use and require no training in contrast to four other frequently employed methods for collecting robot demonstrations.

artificial intelligence, demonstration, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2306.13818

Genre: Research Report > Experimental Study (0.47)

Industry: Leisure & Entertainment > Games > Computer Games (0.94)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.47)

Add feedback