AITopics | Zhao, Mengdi

Collaborating Authors

Zhao, Mengdi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Liu, Jiaming, Chen, Hao, An, Pengju, Liu, Zhuoyang, Zhang, Renrui, Gu, Chenyang, Li, Xiaoqi, Guo, Ziyu, Chen, Sixiang, Liu, Mengzhen, Hou, Chengkai, Zhao, Mengdi, Zhou, KC alex, Heng, Pheng-Ann, Zhang, Shanghang

arXiv.org Artificial IntelligenceMar-17-2025

Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.

large language model, machine learning, preprint arxiv, (18 more...)

arXiv.org Artificial Intelligence

2503.10631

Country: Europe > France (0.14)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

Ji, Yuheng, Tan, Huajie, Shi, Jiayu, Hao, Xiaoshuai, Zhang, Yuan, Zhang, Hengyuan, Wang, Pengwei, Zhao, Mengdi, Mu, Yao, An, Pengju, Xue, Xinda, Su, Qinghang, Lyu, Huaihai, Zheng, Xiaolong, Liu, Jiaming, Wang, Zhongyuan, Zhang, Shanghang

arXiv.org Artificial IntelligenceFeb-28-2025

Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2502.21257

Country: Asia > China (0.28)

Genre:

Overview (0.93)
Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.76)
Health & Medicine > Health Care Technology (0.76)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Gu, Shuhao, Zhao, Mengdi, Zhang, Bowen, Wang, Liangdong, Li, Jijie, Liu, Guang

arXiv.org Artificial IntelligenceOct-5-2024

Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate in all scenarios, and an increase in the average input and output lengths will increases the training and inference costs of the model. Therefore, it is crucial to find ways to improve the model's efficiency with minimal cost while maintaining the model's performance. In this work, we propose a method to improve model representation and processing efficiency by replacing the tokenizers of LLMs. We propose replacing and reinitializing the parameters of the model's input and output layers with the parameters of the original model, and training these parameters while keeping other parameters fixed. We conducted experiments on different LLMs, and the results show that our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.04335

Country:

Europe (1.00)
North America > United States (0.47)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback