AITopics | Yan, Feng

Collaborating Authors

Yan, Feng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DataPlatter: Boosting Robotic Manipulation Generalization with Minimal Costly Data

Zheng, Liming, Yan, Feng, Liu, Fanfan, Feng, Chengjian, Zhong, Yufeng, Huang, Yiyang, Ma, Lin

arXiv.org Artificial IntelligenceMar-25-2025

The growing adoption of Vision-Language-Action (VLA) models in embodied AI intensifies the demand for diverse manipulation demonstrations. However, high costs associated with data collection often result in insufficient data coverage across all scenarios, which limits the performance of the models. It is observed that the spatial reasoning phase (SRP) in large workspace dominates the failure cases. F ortunately, this data can be collected with low cost, underscoring the potential of leveraging inexpensive data to improve model performance. In this paper, we introduce the DataPlatter method, a framework that decouples training trajectories into distinct task stages and leverages abundant easily collectible SRP data to enhance VLA model's generalization. Through analysis we demonstrate that sub-task-specific training with additional SRP data with proper proportion can act as a performance catalyst for robot manipulation, maximizing the utilization of costly physical interaction phase (PIP) data. Experiments show that through introducing large proportion of cost-effective SRP trajectories into a limited set of PIP data, we can achieve a maximum improvement of 41% on success rate in zero-shot scenes, while with the ability to transfer manipulation skill to novel targets.

large language model, machine learning, trajectory, (16 more...)

arXiv.org Artificial Intelligence

2503.19516

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Zhong, Yufeng, Feng, Chengjian, Yan, Feng, Liu, Fanfan, Zheng, Liming, Ma, Lin

arXiv.org Artificial IntelligenceMar-24-2025

In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents must possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we introduce \textbf{P3Nav}, a unified framework that integrates \textbf{P}erception, \textbf{P}lanning, and \textbf{P}rediction capabilities through \textbf{Multitask Collaboration} on navigation and embodied question answering (EQA) tasks, thereby enhancing navigation performance. Furthermore, P3Nav employs an \textbf{Adaptive 3D-aware History Sampling} strategy to effectively and efficiently utilize historical observations. By leveraging the large language models (LLM), P3Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. P3Nav achieves a 75\% success rate in object goal navigation on the $\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art performance.

large language model, natural language, navigation, (16 more...)

arXiv.org Artificial Intelligence

2503.18525

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Agentic Verification for Ambiguous Query Disambiguation

Lee, Youngwon, Hwang, Seung-won, Wu, Ruofan, Yan, Feng, Xu, Danmei, Akkad, Moutasem, Yao, Zhewei, He, Yuxiong

arXiv.org Artificial IntelligenceFeb-14-2025

In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.10352

Country:

Asia (0.68)
North America > United States (0.28)
North America > Mexico (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

A survey on FPGA-based accelerator for ML models

Yan, Feng, Koch, Andreas, Sinnen, Oliver

arXiv.org Artificial IntelligenceDec-20-2024

This paper thoroughly surveys machine learning (ML) algorithms acceleration in hardware accelerators, focusing on Field-Programmable Gate Arrays (FPGAs). It reviews 287 out of 1138 papers from the past six years, sourced from four top FPGA conferences. Such selection underscores the increasing integration of ML and FPGA technologies and their mutual importance in technological advancement. Research clearly emphasises inference acceleration (81\%) compared to training acceleration (13\%). Additionally, the findings reveals that CNN dominates current FPGA acceleration research while emerging models like GNN show obvious growth trends. The categorization of the FPGA research papers reveals a wide range of topics, demonstrating the growing relevance of ML in FPGA research. This comprehensive analysis provides valuable insights into the current trends and future directions of FPGA research in the context of ML applications.

artificial intelligence, machine learning, pattern recognition, (20 more...)

arXiv.org Artificial Intelligence

2412.15666

Country:

Europe (0.67)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Information Technology (1.00)
Energy (0.93)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Add feedback

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

Huang, Zhijian, Feng, Chengjian, Yan, Feng, Xiao, Baihui, Jie, Zequn, Zhong, Yujie, Liang, Xiaodan, Ma, Lin

arXiv.org Artificial IntelligenceDec-13-2024

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world. Project page with code: https://github.com/zhijian11/DriveMM.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.07689

Genre: Research Report > Promising Solution (0.34)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

Yan, Feng, Liu, Fanfan, Zheng, Liming, Zhong, Yufeng, Huang, Yiyang, Guan, Zechao, Feng, Chengjian, Ma, Lin

arXiv.org Artificial IntelligenceDec-10-2024

In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. RoboData offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets. Equipped with RoboData and the unified physical space, RoboMM is the generalist policy that enables simultaneous evaluation across all tasks within multiple datasets, rather than focusing on limited selection of data or tasks. Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.07215

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)

Add feedback

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton

Liu, Fanfan, Yan, Feng, Zheng, Liming, Feng, Chengjian, Huang, Yiyang, Ma, Lin

arXiv.org Artificial IntelligenceJul-12-2024

Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview

artificial intelligence, arxiv preprint arxiv, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2406.18977

Country: Asia > China (0.29)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios

Zheng, Liming, Yan, Feng, Liu, Fanfan, Feng, Chengjian, Kang, Zhuoliang, Ma, Lin

arXiv.org Artificial IntelligenceJul-9-2024

Foundation models hold significant potential for enabling robots to perform long-horizon general manipulation tasks. However, the simplicity of tasks and the uniformity of environments in existing benchmarks restrict their effective deployment in complex scenarios. To address this limitation, this paper introduces the \textit{RoboCAS} benchmark, the first benchmark specifically designed for complex object arrangement scenarios in robotic manipulation. This benchmark employs flexible and concise scripted policies to efficiently collect a diverse array of demonstrations, showcasing scattered, orderly, and stacked object arrangements within a highly realistic physical simulation environment. It includes complex processes such as target retrieval, obstacle clearance, and robot manipulation, testing agents' abilities to perform long-horizon planning for spatial reasoning and predicting chain reactions under ambiguous instructions. Extensive experiments on multiple baseline models reveal their limitations in managing complex object arrangement scenarios, underscoring the urgent need for intelligent agents capable of performing long-horizon operations in practical deployments and providing valuable insights for future research directions. Project website: \url{https://github.com/notFoundThisPerson/RoboCAS-v0}.

artificial intelligence, arxiv preprint arxiv, dataset, (14 more...)

arXiv.org Artificial Intelligence

2407.06951

Country: Asia > China (0.29)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design

Huang, Yiyang, Hao, Yuhui, Yu, Bo, Yan, Feng, Yang, Yuxin, Min, Feng, Han, Yinhe, Ma, Lin, Liu, Shaoshan, Liu, Qiang, Gan, Yiming

arXiv.org Artificial IntelligenceJul-5-2024

Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki

large language model, machine learning, trajectory, (21 more...)

arXiv.org Artificial Intelligence

2407.04292

Country:

Asia > China (0.29)
North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry: Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A New Method in Facial Registration in Clinics Based on Structure Light Images

Li, Pengfei, Ma, Ziyue, Wang, Hong, Deng, Juan, Wang, Yan, Xu, Zhenyu, Yan, Feng, Tu, Wenjun, Sha, Hong

arXiv.org Artificial IntelligenceMay-23-2024

Background and Objective: In neurosurgery, fusing clinical images and depth images that can improve the information and details is beneficial to surgery. We found that the registration of face depth images was invalid frequently using existing methods. To abundant traditional image methods with depth information, a method in registering with depth images and traditional clinical images was investigated. Methods: We used the dlib library, a C++ library that could be used in face recognition, and recognized the key points on faces from the structure light camera and CT image. The two key point clouds were registered for coarse registration by the ICP method. Fine registration was finished after coarse registration by the ICP method. Results: RMSE after coarse and fine registration is as low as 0.995913 mm. Compared with traditional methods, it also takes less time. Conclusions: The new method successfully registered the facial depth image from structure light images and CT with a low error, and that would be promising and efficient in clinical application of neurosurgery.

machine learning, pattern recognition, registration, (18 more...)

arXiv.org Artificial Intelligence

2405.14292

Country: Asia > China (0.31)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Surgery (1.00)
Health & Medicine > Health Care Technology (0.96)
Health & Medicine > Therapeutic Area > Neurology (0.57)

Technology:

Information Technology > Artificial Intelligence > Vision (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.30)

Add feedback