Goto

Collaborating Authors

 Ai, Bo


Learning Adaptive Dexterous Grasping from Single Demonstrations

arXiv.org Artificial Intelligence

How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-language model (VLM). To improve sample efficiency, we propose a trajectory following reward that guides reinforcement learning (RL) toward states close to a human demonstration while allowing flexibility in exploration. To learn beyond the single demonstration, we employ curriculum learning, progressively increasing object pose variations to enhance robustness. At deployment, a VLM retrieves the appropriate skill based on user instructions, bridging low-level learned skills with high-level intent. We evaluate AdaDexGrasp in both simulation and real-world settings, showing that our approach significantly improves RL efficiency and enables learning human-like grasp strategies across varied object configurations. Finally, we demonstrate zero-shot transfer of our learned policies to a real-world PSYONIC Ability Hand, with a 90% success rate across objects, significantly outperforming the baseline.


Diffusion Dynamics Models with Generative State Estimation for Cloth Manipulation

arXiv.org Artificial Intelligence

Our approach integrates state estimation and dynamics modeling under a consistent architecture and training paradigm. Our diffusion-based perception model generates cloth states from partial observations, and the diffusion-based dynamics model generates physically plausible future states conditioned on action sequences, enabling robust model-based control. Our work demonstrates the potential of diffusion models in state estimation and dynamics modeling for manipulation tasks involving partial observability and complex dynamics. Abstract--Manipulating deformable objects like cloth is challenging states given the current state and robot actions. Leveraging a due to their complex dynamics, near-infinite degrees of transformer-based diffusion model, our method achieves highfidelity freedom, and frequent self-occlusions, which complicate state state reconstruction while reducing long-horizon dynamics estimation and dynamics modeling. Prior work has struggled with prediction errors by an order of magnitude compared to robust cloth state estimation, while dynamics models, primarily GNN-based approaches. Integrated with model-predictive control based on Graph Neural Networks (GNNs), are limited by their (MPC), our framework successfully executes cloth folding on a locality. Inspired by recent advances in generative models, we real robotic system, demonstrating the potential of generative hypothesize that these expressive models can effectively capture models for manipulation tasks with partial observability and intricate cloth configurations and deformation patterns from complex dynamics.


A CGAN-LSTM-Based Framework for Time-Varying Non-Stationary Channel Modeling

arXiv.org Artificial Intelligence

Time-varying non-stationary channels, with complex dynamic variations and temporal evolution characteristics, have significant challenges in channel modeling and communication system performance evaluation. Most existing methods of time-varying channel modeling focus on predicting channel state at a given moment or simulating short-term channel fluctuations, which are unable to capture the long-term evolution of the channel. This paper emphasizes the generation of long-term dynamic channel to fully capture evolution of non-stationary channel properties. The generated channel not only reflects temporal dynamics but also ensures consistent stationarity. We propose a hybrid deep learning framework that combines conditional generative adversarial networks (CGAN) with long short-term memory (LSTM) networks. A stationarity-constrained approach is designed to ensure temporal correlation of the generated time-series channel. This method can generate channel with required temporal non-stationarity. The model is validated by comparing channel statistical features, and the results show that the generated channel is in good agreement with raw channel and provides good performance in terms of non-stationarity.


COST CA20120 INTERACT Framework of Artificial Intelligence Based Channel Modeling

arXiv.org Artificial Intelligence

Accurate channel models are the prerequisite for communication-theoretic investigations as well as system design. Channel modeling generally relies on statistical and deterministic approaches. However, there are still significant limits for the traditional modeling methods in terms of accuracy, generalization ability, and computational complexity. The fundamental reason is that establishing a quantified and accurate mapping between physical environment and channel characteristics becomes increasing challenging for modern communication systems. Here, in the context of COST CA20120 Action, we evaluate and discuss the feasibility and implementation of using artificial intelligence (AI) for channel modeling, and explore where the future of this field lies. Firstly, we present a framework of AI-based channel modeling to characterize complex wireless channels. Then, we highlight in detail some major challenges and present the possible solutions: i) estimating the uncertainty of AI-based channel predictions, ii) integrating prior knowledge of propagation to improve generalization capabilities, and iii) interpretable AI for channel modeling. We present and discuss illustrative numerical results to showcase the capabilities of AI-based channel modeling.


AI-Based Beam-Level and Cell-Level Mobility Management for High Speed Railway Communications

arXiv.org Artificial Intelligence

High-speed railway (HSR) communications are pivotal for ensuring rail safety, operations, maintenance, and delivering passenger information services. The high speed of trains creates rapidly time-varying wireless channels, increases the signaling overhead, and reduces the system throughput, making it difficult to meet the growing and stringent needs of HSR applications. In this article, we explore artificial intelligence (AI)-based beam-level and cell-level mobility management suitable for HSR communications, including the use cases, inputs, outputs, and key performance indicators (KPI)s of AI models. Particularly, in comparison to traditional down-sampling spatial beam measurements, we show that the compressed spatial multi-beam measurements via compressive sensing lead to improved spatial-temporal beam prediction. Moreover, we demonstrate the performance gains of AI-assisted cell handover over traditional mobile handover mechanisms. In addition, we observe that the proposed approaches to reduce the measurement overhead achieve comparable radio link failure performance with the traditional approach that requires all the beam measurements of all cells, while the former methods can save 50% beam measurement overhead.


IntentionNet: Map-Lite Visual Navigation at the Kilometre Scale

arXiv.org Artificial Intelligence

Inspired by modern datadriven through diverse environments to distant goals? This remains approaches, the lower level of our system design is an open challenge due to the complexity and difficulty of a neural network-based controller that maps observations designing a robot that can generalise over environments, directly to velocity commands, and which is learned end-toend tolerate significant mapping and positioning inaccuracies from real world experience. Neural networks have the and recover from inevitable navigation errors. While many flexibility to accept a wide variety of input types, and we works tackle robot navigation, few systems capable of find that design space for the signals used by the system's long-range, kilometre-scale navigation exist. Classical robot upper level to guide the lower level is large. We exploit systems capable of long-range navigation like Montemerlo this property to design several different types of guidance et al. (2008); Kรผmmerle et al. (2013) use e xplicit signals, which we call intentions. We find that designing maps and find paths over them using classical planning the appropriate intention imbues the navigation system with algorithms (Siegwart et al. 2011), allowing them to reach specific abilities, such as the ability to tolerate significant arbitrarily distant goals in principle.


RoboPack: Learning Tactile-Informed Dynamics Models for Dense Packing

arXiv.org Artificial Intelligence

Tactile feedback is critical for understanding the dynamics of both rigid and deformable objects in many manipulation tasks, such as non-prehensile manipulation and dense packing. We introduce an approach that combines visual and tactile sensing for robotic manipulation by learning a neural, tactile-informed dynamics model. Our proposed framework, RoboPack, employs a recurrent graph neural network to estimate object states, including particles and object-level latent physics information, from historical visuo-tactile observations and to perform future state predictions. Our tactile-informed dynamics model, learned from real-world data, can solve downstream robotics tasks with model-predictive control. We demonstrate our approach on a real robot equipped with a compliant Soft-Bubble tactile sensor on non-prehensile manipulation and dense packing tasks, where the robot must infer the physics properties of objects from direct and indirect interactions. Trained on only an average of 30 minutes of real-world interaction data per task, our model can perform online adaptation and make touch-informed predictions. Through extensive evaluations in both long-horizon dynamics prediction and real-world manipulation, our method demonstrates superior effectiveness compared to previous learning-based and physics-based simulation systems.


VideoQA-SC: Adaptive Semantic Communication for Video Question Answering

arXiv.org Artificial Intelligence

Although semantic communication (SC) has shown its potential in efficiently transmitting multi-modal data such as text, speeches and images, SC for videos has focused primarily on pixel-level reconstruction. However, these SC systems may be suboptimal for downstream intelligent tasks. Moreover, SC systems without pixel-level video reconstruction present advantages by achieving higher bandwidth efficiency and real-time performance of various intelligent tasks. The difficulty in such system design lies in the extraction of task-related compact semantic representations and their accurate delivery over noisy channels. In this paper, we propose an end-to-end SC system for video question answering (VideoQA) tasks called VideoQA-SC. Our goal is to accomplish VideoQA tasks directly based on video semantics over noisy or fading wireless channels, bypassing the need for video reconstruction at the receiver. To this end, we develop a spatiotemporal semantic encoder for effective video semantic extraction, and a learning-based bandwidth-adaptive deep joint source-channel coding (DJSCC) scheme for efficient and robust video semantic transmission. Experiments demonstrate that VideoQA-SC outperforms traditional and advanced DJSCC-based SC systems that rely on video reconstruction at the receiver under a wide range of channel conditions and bandwidth constraints. In particular, when the signal-to-noise ratio is low, VideoQA-SC can improve the answer accuracy by 5.17% while saving almost 99.5% of the bandwidth at the same time, compared with the advanced DJSCC-based SC system. Our results show the great potential of task-oriented SC system design for video applications.


Generative AI Agent for Next-Generation MIMO Design: Fundamentals, Challenges, and Vision

arXiv.org Artificial Intelligence

Next-generation multiple input multiple output (MIMO) is expected to be intelligent and scalable. In this paper, we study generative artificial intelligence (AI) agent-enabled next-generation MIMO design. Firstly, we provide an overview of the development, fundamentals, and challenges of the next-generation MIMO. Then, we propose the concept of the generative AI agent, which is capable of generating tailored and specialized contents with the aid of large language model (LLM) and retrieval augmented generation (RAG). Next, we comprehensively discuss the features and advantages of the generative AI agent framework. More importantly, to tackle existing challenges of next-generation MIMO, we discuss generative AI agent-enabled next-generation MIMO design, from the perspective of performance analysis, signal processing, and resource allocation. Furthermore, we present two compelling case studies that demonstrate the effectiveness of leveraging the generative AI agent for performance analysis in complex configuration scenarios. These examples highlight how the integration of generative AI agents can significantly enhance the analysis and design of next-generation MIMO systems. Finally, we discuss important potential research future directions.


Invariance is Key to Generalization: Examining the Role of Representation in Sim-to-Real Transfer for Visual Navigation

arXiv.org Artificial Intelligence

The data-driven approach to robot control has been gathering pace rapidly, yet generalization to unseen task domains remains a critical challenge. We argue that the key to generalization is representations that are (i) rich enough to capture all task-relevant information and (ii) invariant to superfluous variability between the training and the test domains. We experimentally study such a representation -- containing both depth and semantic information -- for visual navigation and show that it enables a control policy trained entirely in simulated indoor scenes to generalize to diverse real-world environments, both indoors and outdoors. Further, we show that our representation reduces the A-distance between the training and test domains, improving the generalization error bound as a result. Our proposed approach is scalable: the learned policy improves continuously, as the foundation models that it exploits absorb more diverse data during pre-training.