AITopics | Peng, Yan

Collaborating Authors

Peng, Yan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PointVLA: Injecting the 3D World into Vision-Language-Action Models

Li, Chengmeng, Wen, Junjie, Peng, Yan, Peng, Yaxin, Feng, Feifei, Zhu, Yichen

arXiv.org Artificial IntelligenceMar-10-2025

Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less useful blocks in the vanilla action expert, ensuring that 3D features are injected only into these blocks--minimizing disruption to pre-trained representations. Extensive experiments demonstrate that PointVLA outperforms state-of-the-art 2D imitation learning methods, such as OpenVLA, Diffusion Policy and DexVLA, across both simulated and real-world robotic tasks. Specifically, we highlight several key advantages of PointVLA enabled by point cloud integration: (1) Few-shot multi-tasking, where PointVLA successfully performs four different tasks using only 20 demonstrations each; (2) Real-vs-photo discrimination, where PointVLA distinguishes real objects from their images, leveraging 3D world knowledge to improve safety and reliability; (3) Height adaptability, Unlike conventional 2D imitation learning methods, PointVLA enables robots to adapt to objects at varying table height that unseen in train data. Furthermore, PointVLA achieves strong performance in long-horizon tasks, such as picking and packing objects from a moving conveyor belt, showcasing its ability to generalize across complex, dynamic environments.

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2503.07511

Country:

Europe > Netherlands (0.14)
Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.48)

Add feedback

Unity RL Playground: A Versatile Reinforcement Learning Framework for Mobile Robots

Ye, Linqi, Li, Rankun, Hu, Xiaowen, Li, Jiayi, Xing, Boyang, Peng, Yan, Liang, Bin

arXiv.org Artificial IntelligenceMar-7-2025

This paper introduces Unity RL Playground, an open-source reinforcement learning framework built on top of Unity ML-Agents. Unity RL Playground automates the process of training mobile robots to perform various locomotion tasks such as walking, running, and jumping in simulation, with the potential for seamless transfer to real hardware. Key features include one-click training for imported robot models, universal compatibility with diverse robot configurations, multi-mode motion learning capabilities, and extreme performance testing to aid in robot design optimization and morphological evolution. The attached video can be found at https://linqi-ye.github.io/video/iros25.mp4 and the code is coming soon.

artificial intelligence, robot, unity rl playground, (12 more...)

arXiv.org Artificial Intelligence

2503.05146

Country: Asia > China (0.29)

Genre:

Research Report (0.64)
Instructional Material (0.48)

Industry: Education (0.96)

Technology: Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)

Add feedback

VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

Zhang, Chunbai, Wang, Chao, Zhou, Yang, Peng, Yan

arXiv.org Artificial IntelligenceFeb-2-2025

Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.00711

Country:

Asia (0.28)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Transportation > Ground (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

PhiP-G: Physics-Guided Text-to-3D Compositional Scene Generation

Li, Qixuan, Wang, Chao, He, Zongjin, Peng, Yan

arXiv.org Artificial IntelligenceFeb-2-2025

Text-to-3D asset generation has achieved significant optimization under the supervision of 2D diffusion priors. However, when dealing with compositional scenes, existing methods encounter several challenges: 1). failure to ensure that composite scene layouts comply with physical laws; 2). difficulty in accurately capturing the assets and relationships described in complex scene descriptions; 3). limited autonomous asset generation capabilities among layout approaches leveraging large language models (LLMs). To avoid these compromises, we propose a novel framework for compositional scene generation, PhiP-G, which seamlessly integrates generation techniques with layout guidance based on a world model. Leveraging LLM-based agents, PhiP-G analyzes the complex scene description to generate a scene graph, and integrating a multimodal 2D generation agent and a 3D Gaussian generation method for targeted assets creation. For the stage of layout, PhiP-G employs a physical pool with adhesion capabilities and a visual supervision agent, forming a world model for layout prediction and planning. Extensive experiments demonstrate that PhiP-G significantly enhances the generation quality and physical rationality of the compositional scenes. Notably, PhiP-G attains state-of-the-art (SOTA) performance in CLIP scores, achieves parity with the leading methods in generation quality as measured by the T$^3$Bench, and improves efficiency by 24x.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.00708

Country: Asia > China (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Hybrid Physics-ML Modeling for Marine Vehicle Maneuvering Motions in the Presence of Environmental Disturbances

Wang, Zihao, Cheng, Jian, Xu, Liang, Hao, Lizhu, Peng, Yan

arXiv.org Artificial IntelligenceNov-21-2024

A hybrid physics-machine learning modeling framework is proposed for the surface vehicles' maneuvering motions to address the modeling capability and stability in the presence of environmental disturbances. From a deep learning perspective, the framework is based on a variant version of residual networks with additional feature extraction. Initially, an imperfect physical model is derived and identified to capture the fundamental hydrodynamic characteristics of marine vehicles. This model is then integrated with a feedforward network through a residual block. Additionally, feature extraction from trigonometric transformations is employed in the machine learning component to account for the periodic influence of currents and waves. The proposed method is evaluated using real navigational data from the 'JH7500' unmanned surface vehicle. The results demonstrate the robust generalizability and accurate long-term prediction capabilities of the nonlinear dynamic model in specific environmental conditions. This approach has the potential to be extended and applied to develop a comprehensive high-fidelity simulator.

data mining, machine learning, physical model, (18 more...)

arXiv.org Artificial Intelligence

2411.13908

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Transportation > Marine (0.85)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario

Sun, Yimin, Wang, Chao, Peng, Yan

arXiv.org Artificial IntelligenceDec-4-2023

Visual question answering (VQA) is a fundamental and essential AI task, and VQA-based disaster scenario understanding is a hot research topic. For instance, we can ask questions about a disaster image by the VQA model and the answer can help identify whether anyone or anything is affected by the disaster. However, previous VQA models for disaster damage assessment have some shortcomings, such as limited candidate answer space, monotonous question types, and limited answering capability of existing models. In this paper, we propose a zero-shot VQA model named Zero-shot VQA for Flood Disaster Damage Assessment (ZFDDA). It is a VQA model for damage assessment without pre-training. Also, with flood disaster as the main research object, we build a Freestyle Flood Disaster Image Question Answering dataset (FFD-IQA) to evaluate our VQA model. This new dataset expands the question types to include free-form, multiple-choice, and yes-no questions. At the same time, we expand the size of the previous dataset to contain a total of 2,058 images and 22,422 question-meta ground truth pairs. Most importantly, our model uses well-designed chain of thought (CoT) demonstrations to unlock the potential of the large language model, allowing zero-shot VQA to show better performance in disaster scenarios. The experimental results show that the accuracy in answering complex questions is greatly improved with CoT prompts. Our study provides a research basis for subsequent research of VQA for other disaster scenarios.

large language model, natural language, question answering, (15 more...)

arXiv.org Artificial Intelligence

2312.01882

Country: North America > United States (0.14)

Genre: Research Report > Experimental Study (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

From Knowing to Doing: Learning Diverse Motor Skills through Instruction Learning

Ye, Linqi, Li, Jiayi, Cheng, Yi, Wang, Xianhao, Liang, Bin, Peng, Yan

arXiv.org Artificial IntelligenceNov-1-2023

Recent years have witnessed many successful trials in the robot learning field. For contact-rich robotic tasks, it is challenging to learn coordinated motor skills by reinforcement learning. Imitation learning solves this problem by using a mimic reward to encourage the robot to track a given reference trajectory. However, imitation learning is not so efficient and may constrain the learned motion. In this paper, we propose instruction learning, which is inspired by the human learning process and is highly efficient, flexible, and versatile for robot motion learning. Instead of using a reference signal in the reward, instruction learning applies a reference signal directly as a feedforward action, and it is combined with a feedback action learned by reinforcement learning to control the robot. Besides, we propose the action bounding technique and remove the mimic reward, which is shown to be crucial for efficient and flexible learning. We compare the performance of instruction learning with imitation learning, indicating that instruction learning can greatly speed up the training process and guarantee learning the desired motion correctly. The effectiveness of instruction learning is validated through a bunch of motion learning examples for a biped robot and a quadruped robot, where skills can be learned typically within several million steps. Besides, we also conduct sim-to-real transfer and online learning experiments on a real quadruped robot. Instruction learning has shown great merits and potential, making it a promising alternative for imitation learning.

learning diverse motor skill, machine learning, reinforcement learning, (2 more...)

arXiv.org Artificial Intelligence

2309.09167

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.44)

Add feedback

A Data Driven Method for Multi-step Prediction of Ship Roll Motion in High Sea States

Zhang, Dan, Zhou, Xi, Wang, Zi-Hao, Peng, Yan, Xie, Shao-Rong

arXiv.org Artificial IntelligenceFeb-7-2023

Ship roll motion in high sea states has large amplitudes and nonlinear dynamics, and its prediction is significant for operability, safety, and survivability. This paper presents a novel data-driven methodology to provide a multi-step prediction of ship roll motions in high sea states. A hybrid neural network is proposed that combines long short-term memory (LSTM) and convolutional neural network (CNN) in parallel. The motivation is to extract the nonlinear dynamic characteristics and the hydrodynamic memory information through the advantage of CNN and LSTM, respectively. For the feature selection, the time histories of motion states and wave heights are selected to involve sufficient information. Taken a scaled KCS as the study object, the ship motions in sea state 7 irregular long-crested waves are simulated and used for the validation. The results show that at least one period of roll motion can be accurately predicted. Compared with the single LSTM and CNN methods, the proposed method has better performance in predicting the amplitude of roll angles. Besides, the comparison results also demonstrate that selecting motion states and wave heights as feature space improves the prediction accuracy, verifying the effectiveness of the proposed method.

artificial intelligence, machine learning, roll motion, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.oceaneng.2023.114230

2207.12673

Country: Asia > China (0.16)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (0.69)
Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning Bifunctional Push-grasping Synergistic Strategy for Goal-agnostic and Goal-oriented Tasks

Ren, Dafa, Wu, Shuang, Wang, Xiaofan, Peng, Yan, Ren, Xiaoqiang

arXiv.org Artificial IntelligenceDec-4-2022

Both goal-agnostic and goal-oriented tasks have practical value for robotic grasping: goal-agnostic tasks target all objects in the workspace, while goal-oriented tasks aim at grasping pre-assigned goal objects. However, most current grasping methods are only better at coping with one task. In this work, we propose a bifunctional push-grasping synergistic strategy for goal-agnostic and goal-oriented grasping tasks. Our method integrates pushing along with grasping to pick up all objects or pre-assigned goal objects with high action efficiency depending on the task requirement. We introduce a bifunctional network, which takes in visual observations and outputs dense pixel-wise maps of Q values for pushing and grasping primitive actions, to increase the available samples in the action space. Then we propose a hierarchical reinforcement learning framework to coordinate the two tasks by considering the goal-agnostic task as a combination of multiple goal-oriented tasks. To reduce the training difficulty of the hierarchical framework, we design a two-stage training method to train the two types of tasks separately. We perform pre-training of the model in simulation, and then transfer the learned model to the real world without any additional real-world fine-tuning. Experimental results show that the proposed approach outperforms existing methods in task completion rate and grasp success rate with less motion number. Supplementary material is available at https: //github.com/DafaRen/Learning_Bifunctional_Push-grasping_Synergistic_Strategy_for_Goal-agnostic_and_Goal-oriented_Tasks

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2212.01763

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback