AITopics | Zhang, Zhizheng

Collaborating Authors

Zhang, Zhizheng

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FetchBot: Object Fetching in Cluttered Shelves via Zero-Shot Sim2Real

Liu, Weiheng, Wan, Yuxuan, Wang, Jilong, Kuang, Yuxuan, Shi, Xuesong, Li, Haoran, Zhao, Dongbin, Zhang, Zhizheng, Wang, He

arXiv.org Artificial IntelligenceFeb-25-2025

Object fetching from cluttered shelves is an important capability for robots to assist humans in real-world scenarios. Achieving this task demands robotic behaviors that prioritize safety by minimizing disturbances to surrounding objects, an essential but highly challenging requirement due to restricted motion space, limited fields of view, and complex object dynamics. In this paper, we introduce FetchBot, a sim-to-real framework designed to enable zero-shot generalizable and safety-aware object fetching from cluttered shelves in real-world settings. To address data scarcity, we propose an efficient voxel-based method for generating diverse simulated cluttered shelf scenes at scale and train a dynamics-aware reinforcement learning (RL) policy to generate object fetching trajectories within these scenes. This RL policy, which leverages oracle information, is subsequently distilled into a vision-based policy for real-world deployment. Considering that sim-to-real discrepancies stem from texture variations mostly while from geometric dimensions rarely, we propose to adopt depth information estimated by full-fledged depth foundation models as the input for the vision-based policy to mitigate sim-to-real gap. To tackle the challenge of limited views, we design a novel architecture for learning multi-view representations, allowing for comprehensive encoding of cluttered shelf scenes. This enables FetchBot to effectively minimize collisions while fetching objects from varying positions and depths, ensuring robust and safety-aware operation. Both simulation and real-robot experiments demonstrate FetchBot's superior generalization ability, particularly in handling a broad range of real-world scenarios, includ

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.17894

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Qi, Zekun, Zhang, Wenyao, Ding, Yufei, Dong, Runpei, Yu, Xinqiang, Li, Jingwen, Xu, Lingyun, Li, Baoyu, He, Xialin, Fan, Guofan, Zhang, Jiazhao, He, Jiawei, Gu, Jiayuan, Jin, Xin, Ma, Kaisheng, Zhang, Zhizheng, Wang, He, Yi, Li

arXiv.org Artificial IntelligenceFeb-18-2025

Spatial intelligence is a critical component of embodied AI, promoting robots to understand and interact with their environments. While recent advances have enhanced the ability of VLMs to perceive object locations and positional relationships, they still lack the capability to precisely understand object orientations-a key requirement for tasks involving fine-grained manipulations. Addressing this limitation not only requires geometric reasoning but also an expressive and intuitive way to represent orientation. In this context, we propose that natural language offers a more flexible representation space than canonical frames, making it particularly suitable for instruction-following robotic systems. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the ''handle'' direction of a knife). To support this, we construct OrienText300K, a large-scale dataset of 3D models annotated with semantic orientations that link geometric understanding to functional semantics. By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints. Extensive experiments in simulation and real world demonstrate that our approach significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy on Open6DOR and 74.9% accuracy on SIMPLER.

large language model, machine learning, orientation, (23 more...)

arXiv.org Artificial Intelligence

2502.13143

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.67)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (0.67)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.45)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(5 more...)

Add feedback

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Zhang, Jiazhao, Wang, Kunyu, Wang, Shaoan, Li, Minghan, Liu, Haoran, Wei, Songlin, Wang, Zhongyuan, Zhang, Zhizheng, Wang, He

arXiv.org Artificial IntelligenceDec-9-2024

A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.06224

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Zhou, Enshen, Su, Qi, Chi, Cheng, Zhang, Zhizheng, Wang, Zhongyuan, Huang, Tiejun, Sheng, Lu, Wang, He

arXiv.org Artificial IntelligenceDec-9-2024

Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the vision-language model (VLM) for both open-set reactive and proactive failure detection. The core of our method is to formulate both tasks as a unified set of spatio-temporal constraint satisfaction problems and use VLM-generated code to evaluate them for real-time monitoring. To enhance the accuracy and efficiency of monitoring, we further introduce constraint elements that abstract constraint-related entities or their parts into compact geometric elements. This approach offers greater generality, simplifies tracking, and facilitates constraint-aware visual programming by leveraging these elements as visual prompts. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances compared to baselines across three simulators and a real-world setting. Moreover, CaM can be integrated with open-loop control policies to form closed-loop systems, enabling long-horizon tasks in cluttered scenes with dynamic environments.

constraint, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.04455

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports > Golf (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (1.00)
(2 more...)

Add feedback

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Zhang, Jiazhao, Wang, Kunyu, Xu, Rongtao, Zhou, Gengze, Hong, Yicong, Fang, Xiaomeng, Wu, Qi, Zhang, Zhizheng, Wang, He

arXiv.org Artificial IntelligenceJun-30-2024

Vision-and-language navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavor to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometers, or depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision making and instruction following. We train NaVid with 510k navigation samples collected from continuous environments, including action-planning and instruction-reasoning samples, along with 763k large-scale web data. Extensive experiments show that NaVid achieves state-of-the-art performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2402.15852

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games > Computer Games (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
(2 more...)

Add feedback

A General Theory for Compositional Generalization

Fu, Jingwen, Zhang, Zhizheng, Lu, Yan, Zheng, Nanning

arXiv.org Artificial IntelligenceMay-19-2024

Compositional Generalization (CG) embodies the ability to comprehend novel combinations of familiar concepts, representing a significant cognitive leap in human intellectual advancement. Despite its critical importance, the deep neural network (DNN) faces challenges in addressing the compositional generalization problem, prompting considerable research interest. However, existing theories often rely on task-specific assumptions, constraining the comprehensive understanding of CG. This study aims to explore compositional generalization from a task-agnostic perspective, offering a complementary viewpoint to task-specific analyses. The primary challenge is to define CG without overly restricting its scope, a feat achieved by identifying its fundamental characteristics and basing the definition on them. Using this definition, we seek to answer the question "what does the ultimate solution to CG look like?" through the following theoretical findings: 1) the first No Free Lunch theorem in CG, indicating the absence of general solutions; 2) a novel generalization bound applicable to any CG problem, specifying the conditions for an effective CG solution; and 3) the introduction of the generative effect to enhance understanding of CG problems and their solutions. This paper's significance lies in providing a general theory for CG problems, which, when combined with prior theorems under task-specific scenarios, can lead to a comprehensive understanding of CG.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2405.11743

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Zhang, Zhizheng, Zhang, Xiaoyi, Xie, Wenxuan, Lu, Yan

arXiv.org Artificial IntelligenceDec-4-2023

They have shown a promising prospect in automatically completing tasks upon user instructions, functioning as brain-like coordinators. The associated risks will be revealed as we delegate an increasing number of tasks to machines for automated completion. A big question emerges: how can we make machines behave responsibly when helping humans automate tasks as personal copilots? In this paper, we explore this question in depth from the perspectives of feasibility, completeness and security. In specific, we present Responsible Task Automation (ResponsibleTA) as a fundamental framework to facilitate responsible collaboration between LLM-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e.g., the protection of users' privacy). We further propose and compare two paradigms for implementing the first two capabilities. One is to leverage the generic knowledge of LLMs themselves via prompt engineering while the other is to adopt domain-specific learnable models. Moreover, we introduce a local memory mechanism for achieving the third capability. We evaluate our proposed ResponsibleTA on UI task automation and hope it could bring more attentions to ensuring LLMs more responsible in diverse scenarios.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.01242

Country:

Asia > Middle East > Israel (0.14)
North America > United States (0.14)

Genre:

Research Report (0.50)
Workflow (0.49)

Industry: Information Technology > Security & Privacy (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Learning Trajectories are Generalization Indicators

Fu, Jingwen, Zhang, Zhizheng, Yin, Dacheng, Lu, Yan, Zheng, Nanning

arXiv.org Artificial IntelligenceOct-31-2023

This paper explores the connection between learning trajectories of Deep Neural Networks (DNNs) and their generalization capabilities when optimized using (stochastic) gradient descent algorithms. Instead of concentrating solely on the generalization error of the DNN post-training, we present a novel perspective for analyzing generalization error by investigating the contribution of each update step to the change in generalization error. This perspective allows for a more direct comprehension of how the learning trajectory influences generalization error. Building upon this analysis, we propose a new generalization bound that incorporates more extensive trajectory information. Our proposed generalization bound depends on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental findings reveal that our method effectively captures the generalization error throughout the training process. Furthermore, our approach can also track changes in generalization error when adjustments are made to learning rates and label noise levels. These results demonstrate that learning trajectory information is a valuable indicator of a model's generalization capabilities.

artificial intelligence, generalization error, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2304.12579

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.57)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

When and Why Momentum Accelerates SGD:An Empirical Study

Fu, Jingwen, Wang, Bohan, Zhang, Huishuai, Zhang, Zhizheng, Chen, Wei, Zheng, Nanning

arXiv.org Artificial IntelligenceJun-15-2023

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

artificial intelligence, machine learning, momentum, (15 more...)

arXiv.org Artificial Intelligence

2306.09

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Versatile Neural Processes for Learning Implicit Neural Representations

Guo, Zongyu, Lan, Cuiling, Zhang, Zhizheng, Lu, Yan, Chen, Zhibo

arXiv.org Artificial IntelligenceFeb-21-2023

Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning. Code is available at https://github.com/ZongyuGuo/Versatile-NP .

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2301.08883

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback