Goto

Collaborating Authors

 Zhu, Xiang


Video Super-Resolution: All You Need is a Video Diffusion Model

arXiv.org Artificial Intelligence

The concept of super-resolution was first proposed in the 1980s [1, 2], primarily focusing on multi-frame image super-resolution, also known as video super-resolution (VSR). The fundamental principle involves aligning and fusing image information of the same object across multiple frames to surpass the Nyquist limit. This process represents a typical inverse problem, requiring sub-pixel spatial alignment across frames, along with resampling and deconvolution to achieve enhanced resolution. Over the past decade, the primary focus of super-resolution has shifted towards single image super-resolution (SISR), which eliminates the need for spatial alignment or motion estimation. The recovery of high-frequency components in SISR predominantly relies on deep neural networks such as convolutional neural networks (CNNs) [3, 4, 5]. These networks are capable of mapping low-resolution (LR) input image to the corresponding high-resolution (HR) output, mimicking the behavior of deconvolution. Such methods are effective when the upscaling factor is less than 4x; however, beyond this value, the output images tend to appear overly smoothed. Since 2022, diffusion models (DMs) [6, 7] have become increasingly important in SISR.


UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

arXiv.org Artificial Intelligence

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.


Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

arXiv.org Artificial Intelligence

Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges.


Stylized Table Tennis Robots Skill Learning with Incomplete Human Demonstrations

arXiv.org Artificial Intelligence

In recent years, Reinforcement Learning (RL) is becoming a popular technique for training controllers for robots. However, for complex dynamic robot control tasks, RL-based method often produces controllers with unrealistic styles. In contrast, humans can learn well-stylized skills under supervisions. For example, people learn table tennis skills by imitating the motions of coaches. Such reference motions are often incomplete, e.g. without the presence of an actual ball. Inspired by this, we propose an RL-based algorithm to train a robot that can learn the playing style from such incomplete human demonstrations. We collect data through the teaching-and-dragging method. We also propose data augmentation techniques to enable our robot to adapt to balls of different velocities. We finally evaluate our policy in different simulators with varying dynamics.


A Contact-Safe Reinforcement Learning Framework for Contact-Rich Robot Manipulation

arXiv.org Artificial Intelligence

Reinforcement learning shows great potential to solve complex contact-rich robot manipulation tasks. However, the safety of using RL in the real world is a crucial problem, since unexpected dangerous collisions might happen when the RL policy is imperfect during training or in unseen scenarios. In this paper, we propose a contact-safe reinforcement learning framework for contact-rich robot manipulation, which maintains safety in both the task space and joint space. When the RL policy causes unexpected collisions between the robot arm and the environment, our framework is able to immediately detect the collision and ensure the contact force to be small. Furthermore, the end-effector is enforced to perform contact-rich tasks compliantly, while keeping robust to external disturbances. We train the RL policy in simulation and transfer it to the real robot. Real world experiments on robot wiping tasks show that our method is able to keep the contact force small both in task space and joint space even when the policy is under unseen scenario with unexpected collision, while rejecting the disturbances on the main task.