AITopics | Guo, Yanjiang

Collaborating Authors

Guo, Yanjiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Zhang, Jianke, Guo, Yanjiang, Hu, Yucheng, Chen, Xiaoyu, Zhu, Xiang, Chen, Jianyu

arXiv.org Artificial IntelligenceFeb-2-2025

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.18867

Country: Asia > China (0.28)

Genre:

Research Report > New Finding (0.34)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.48)
(2 more...)

Add feedback

Improving Vision-Language-Action Model with Online Reinforcement Learning

Guo, Yanjiang, Zhang, Jianke, Chen, Xiaoyu, Ji, Xiang, Wang, Yen-Jen, Hu, Yucheng, Chen, Jianyu

arXiv.org Artificial IntelligenceJan-27-2025

Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.

machine learning, reinforcement learning, vla model, (15 more...)

arXiv.org Artificial Intelligence

2501.16664

Country:

Asia > China (0.28)
North America > United States > California (0.14)

Genre:

Research Report (1.00)
Instructional Material > Online (0.41)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Hu, Yucheng, Guo, Yanjiang, Wang, Pengchao, Chen, Xiaoyu, Wang, Yen-Jen, Zhang, Jianke, Sreenath, Koushil, Lu, Chaochao, Chen, Jianyu

arXiv.org Artificial IntelligenceDec-19-2024

Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2412.14803

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.34)

Add feedback

Prediction with Action: Visual Policy Learning via Joint Denoising Process

Guo, Yanjiang, Hu, Yucheng, Zhang, Jianke, Wang, Yen-Jen, Chen, Xiaoyu, Lu, Chaochao, Chen, Jianyu

arXiv.org Artificial IntelligenceNov-27-2024

Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities--image prediction and robotic action, respectively--they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline.

artificial intelligence, arxiv preprint arxiv, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2411.18179

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers

Zhang, Jianke, Guo, Yanjiang, Chen, Xiaoyu, Wang, Yen-Jen, Hu, Yucheng, Shi, Chengming, Chen, Jianyu

arXiv.org Artificial IntelligenceOct-21-2024

Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.05273

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment

Guo, Yanjiang, Wang, Yen-Jen, Zha, Lihan, Jiang, Zheyuan, Chen, Jianyu

arXiv.org Artificial IntelligenceSep-30-2023

Large language models (LLMs) encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. Previous work has explored how to ground LLMs in robotic tasks to generate feasible and executable textual plans. However, low-level execution in the physical world may deviate from the high-level textual plan due to environmental perturbations or imperfect controller design. In this paper, we propose \textbf{DoReMi}, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, we leverage LLMs to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. Then vision language models (VLMs) are utilized to detect constraint violations continuously. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. Videos of DoReMi are available at \url{https://sites.google.com/view/doremi-paper}.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2307.00329

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Decentralized Motor Skill Learning for Complex Robotic Systems

Guo, Yanjiang, Jiang, Zheyuan, Wang, Yen-Jen, Gao, Jingyue, Chen, Jianyu

arXiv.org Artificial IntelligenceJun-30-2023

Reinforcement learning (RL) has achieved remarkable success in complex robotic systems (eg. quadruped locomotion). In previous works, the RL-based controller was typically implemented as a single neural network with concatenated observation input. However, the corresponding learned policy is highly task-specific. Since all motors are controlled in a centralized way, out-of-distribution local observations can impact global motors through the single coupled neural network policy. In contrast, animals and humans can control their limbs separately. Inspired by this biological phenomenon, we propose a Decentralized motor skill (DEMOS) learning algorithm to automatically discover motor groups that can be decoupled from each other while preserving essential connections and then learn a decentralized motor control policy. Our method improves the robustness and generalization of the policy without sacrificing performance. Experiments on quadruped and humanoid robots demonstrate that the learned policy is robust against local motor malfunctions and can be transferred to new tasks.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2306.17411

Country:

North America (0.28)
Asia > China (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

Guo, Yanjiang, Gao, Jingyue, Wu, Zheng, Shi, Chengming, Chen, Jianyu

arXiv.org Artificial IntelligenceMar-8-2023

Reinforcement learning has been applied to various real-world tasks, including robotic manipulation with large state-action spaces and sparse reward signals [1]. In these tasks, standard reinforcement learning tends to perform a lot of useless exploration and easily fall into local optimal solutions. To eliminate this problem, previous works often use expert demonstrations to aid online learning, which adopt some successful trajectories to guide the exploration process [2, 3]. However, standard learning from demonstration algorithms often assume that the target leaning task is exactly same with the task where demonstrations are collected [4, 5, 6]. Under this assumption, experts need to collect the corresponding demonstration for each new task, which can be expensive and inefficient. In this paper, we consider a new learning setting where expert data is collected under a single task, while the agent is required to solve different new tasks. For instance as shown in Figure 1, a robot arm aims to solve peg-in-hole tasks.The demonstration is collected on a certain type of hole while the target tasks have different hole shapes (changes in environmental dynamics) or position shifts (changes in reward function). This can be challenging as agents cannot directly imitate those demonstrations from mismatched tasks due to dynamics and reward function changes. However, compared to learning from scratch, those demonstrations should still be able to provide some useful information to help exploration.

demonstration, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2212.01509

Country:

North America > United States (0.46)
Asia (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback