Goto

Collaborating Authors

 possible publication


LLM-Driven Corrective Robot Operation Code Generation with Static Text-Based Simulation

Wang, Wenhao, Rong, Yi, Li, Yanyan, Jiao, Long, Yuan, Jiawei

arXiv.org Artificial Intelligence

Recent advances in Large language models (LLMs) have demonstrated their promising capabilities of generating robot operation code to enable LLM-driven robots. To enhance the reliability of operation code generated by LLMs, corrective designs with feedback from the observation of executing code have been increasingly adopted in existing research. However, the code execution in these designs relies on either a physical experiment or a customized simulation environment, which limits their deployment due to the high configuration effort of the environment and the potential long execution time. In this paper, we explore the possibility of directly leveraging LLM to enable static simulation of robot operation code, and then leverage it to design a new reliable LLM-driven corrective robot operation code generation framework. Our framework configures the LLM as a static simulator with enhanced capabilities that reliably simulate robot code execution by interpreting actions, reasoning over state transitions, analyzing execution outcomes, and generating semantic observations that accurately capture trajectory dynamics. To validate the performance of our framework, we performed experiments on various operation tasks for different robots, including UAVs and small ground vehicles. The experiment results not only demonstrated the high accuracy of our static text-based simulation but also the reliable code generation of our LLM-driven corrective framework, which achieves a comparable performance with state-of-the-art research while does not rely on dynamic code execution using physical experiments or simulators.


TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

Tomov, Momchil S., Lee, Sang Uk, Hendrago, Hansford, Huh, Jinwook, Han, Teawon, Howington, Forbes, da Silva, Rafael, Bernasconi, Gianmarco, Heim, Marc, Findler, Samuel, Ji, Xiaonan, Boule, Alexander, Napoli, Michael, Chen, Kuo, Miller, Jesse, Floor, Boaz, Hu, Yunqing

arXiv.org Artificial Intelligence

We present TreeIRL, a novel planner for autonomous driving that combines Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to achieve state-of-the-art performance in simulation and in real-world driving. The core idea is to use MCTS to find a promising set of safe candidate trajectories and a deep IRL scoring function to select the most human-like among them. We evaluate TreeIRL against both classical and state-of-the-art planners in large-scale simulations and on 500+ miles of real-world autonomous driving in the Las Vegas metropolitan area. Test scenarios include dense urban traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves the best overall performance, striking a balance between safety, progress, comfort, and human-likeness. To our knowledge, our work is the first demonstration of MCTS-based planning on public roads and underscores the importance of evaluating planners across a diverse set of metrics and in real-world environments. TreeIRL is highly extensible and could be further improved with reinforcement learning and imitation learning, providing a framework for exploring different combinations of classical and learning-based approaches to solve the planning bottleneck in autonomous driving.


Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation

Natan, Oskar, Miura, Jun

arXiv.org Artificial Intelligence

The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems.


Degradation-Aware Cooperative Multi-Modal GNSS-Denied Localization Leveraging LiDAR-Based Robot Detections

Pritzl, Václav, Yu, Xianjia, Westerlund, Tomi, Štěpán, Petr, Saska, Martin

arXiv.org Artificial Intelligence

This work has been submitted to the IEEE for possible publication. Abstract--Accurate long-term localization using onboard sensors is crucial for robots operating in Global Navigation Satellite System (GNSS)-denied environments. While complementary sensors mitigate individual degradations, carrying all the available sensor types on a single robot significantly increases the size, weight, and power demands. Distributing sensors across multiple robots enhances the deployability but introduces challenges in fusing asynchronous, multi-modal data from independently moving platforms. We propose a novel adaptive multi-modal multi-robot cooperative localization approach using a factor-graph formulation to fuse asynchronous Visual-Inertial Odome-try (VIO), LiDAR-Inertial Odometry (LIO), and 3D inter-robot detections from distinct robots in a loosely-coupled fashion. The approach adapts to changing conditions, leveraging reliable data to assist robots affected by sensory degradations. A novel interpolation-based factor enables fusion of the unsynchronized measurements. LIO degradations are evaluated based on the approximate scan-matching Hessian. A novel approach of weighting odometry data proportionally to the Wasserstein distance between the consecutive VIO outputs is proposed. A theoretical analysis is provided, investigating the cooperative localization problem under various conditions, mainly in the presence of sensory degradations. The proposed method has been extensively evaluated on real-world data gathered with heterogeneous teams of an Unmanned Ground V ehicle (UGV) and Unmanned Aerial V ehicles (UA Vs), showing that the approach provides significant improvements in localization accuracy in the presence of various sensory degradations. N Global Navigation Satellite System (GNSS)-denied environments, fusing different localization modalities is crucial to provide robustness to various environmental challenges [1]. Visual-based localization requires cheap and light-weight sensors, but it is sensitive to illumination changes and texture-less environments. This work was supported by CTU grant no SGS23/177/OHK3/3T/13, by the Czech Science Foundation (GA ˇ CR) under research project No. 23-07517S, and by the European Union under the project Robotics and advanced industrial production (reg.


Enhancing the Cross-Size Generalization for Solving Vehicle Routing Problems via Continual Learning

Li, Jingwen, Cao, Zhiguang, Wu, Yaoxin, Liu, Tang

arXiv.org Artificial Intelligence

Exploring machine learning techniques for addressing vehicle routing problems has attracted considerable research attention. To achieve decent and efficient solutions, existing deep models for vehicle routing problems are typically trained and evaluated using instances of a single size. This substantially limits their ability to generalize across different problem sizes and thus hampers their practical applicability. To address the issue, we propose a continual learning based framework that sequentially trains a deep model with instances of ascending problem sizes. Specifically, on the one hand, we design an inter-task regularization scheme to retain the knowledge acquired from smaller problem sizes in the model training on a larger size. On the other hand, we introduce an intra-task regularization scheme to consolidate the model by imitating the latest desirable behaviors during training on each size. Additionally, we exploit the experience replay to revisit instances of formerly trained sizes for mitigating the catastrophic forgetting. Experimental results show that our approach achieves predominantly superior performance across various problem sizes (either seen or unseen in the training), as compared to state-of-the-art deep models including the ones specialized for generalizability enhancement. Meanwhile, the ablation studies on the key designs manifest their synergistic effect in the proposed framework.


Geometric Model Predictive Path Integral for Agile UAV Control with Online Collision Avoidance

Pochobradský, Pavel, Procházka, Ondřej, Pěnička, Robert, Vonásek, Vojtěch, Saska, Martin

arXiv.org Artificial Intelligence

In this letter, we introduce Geometric Model Predictive Path Integral (GMPPI), a sampling-based controller capable of tracking agile trajectories while avoiding obstacles. In each iteration, GMPPI generates a large number of candidate rollout trajectories and then averages them to create a nominal control to be followed by the Unmanned Aerial Vehicle (UAV). We propose using geometric SE(3) control to generate part of the rollout trajectories, significantly increasing precision in agile flight. Furthermore, we introduce varying rollout simulation time step length and dynamic cost and noise parameters, vastly improving tracking performance of smooth and low-speed trajectories over an existing Model Predictive Path Integral (MPPI) implementation. Finally, we propose an integration of GMPPI with a stereo depth camera, enabling online obstacle avoidance at high speeds, a crucial step towards autonomous UAV flights in complex environments. The proposed controller can track simulated agile reference trajectories with position error similar to the geometric SE(3) controller. However, the same configuration of the proposed controller can avoid obstacles in a simulated forest environment at speeds of up to 13m/s, surpassing the performance of a state-of-the-art obstacle-aware planner. In real-world experiments, GMPPI retains the capability to track agile trajectories and avoids obstacles at speeds of up to 10m/s.


Weight Mapping Properties of a Dual Tree Single Clock Adiabatic Capacitive Neuron

Smart, Mike, Maheshwari, Sachin, Raghav, Himadri Singh, Serb, Alexander

arXiv.org Artificial Intelligence

Dual Tree Single Clock (DTSC) Adiabatic Capacitive Neuron (ACN) circuits offer the potential for highly energy-efficient Artificial Neural Network (ANN) computation in full custom analog IC designs. The efficient mapping of Artificial Neuron (AN) abstract weights, extracted from the software-trained ANNs, onto physical ACN capacitance values has, however, yet to be fully researched. In this paper, we explore the unexpected hidden complexities, challenges and properties of the mapping, as well as, the ramifications for IC designers in terms accuracy, design and implementation. We propose an optimal, AN to ACN methodology, that promotes smaller chip sizes and improved overall classification accuracy, necessary for successful practical deployment. Using TensorFlow and Larq software frameworks, we train three different ANN networks and map their weights into the energy-efficient DTSC ACN capacitance value domain to demonstrate 100% functional equivalency. Finally, we delve into the impact of weight quantization on ACN performance using novel metrics related to practical IC considerations, such as IC floor space and comparator decision-making efficacy.


HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling

Si, Yuke, Yang, Runyan, Gao, Yingying, Feng, Junlan, Deng, Chao, Zhang, Shilei

arXiv.org Artificial Intelligence

Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integration of both linguistic and paralinguistic cues. Existing multitask SLMs typically adopt naive parameter sharing or prompt-based conditioning without explicitly modeling the differences in information composition required by each task. Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. HarmoniFuse is designed to harmonize heterogeneous task demands by selecting and fusing task-relevant components of speech representations. Specifically, it integrates a gated speech encoder to extract task-specific acoustic features and a prompt-adaptive dynamic fusion module to aggregate transformer layers based on task characteristics. In addition, a batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation. Experimental results demonstrate that HarmoniFuse improves both ASR and SER performance, offering a scalable and robust solution for multitask speech understanding under realistic data constraints.


Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning

Abdelfattah, Abdullah, Khalil, Mahmoud I., Abbas, Hazem

arXiv.org Artificial Intelligence

Assessing spoken language is challenging, and quantifying pronunciation metrics for machine learning models is even harder. However, for the Holy Quran, this task is simplified by the rigorous recitation rules (tajweed) established by Muslim scholars, enabling highly effective assessment. Despite this advantage, the scarcity of high-quality annotated data remains a significant barrier. In this work, we bridge these gaps by introducing: (1) A 98% automated pipeline to produce high-quality Quranic datasets -- encompassing: Collection of recitations from expert reciters, Segmentation at pause points (waqf) using our fine-tuned wav2vec2-BERT model, Transcription of segments, Transcript verification via our novel Tasmeea algorithm; (2) 850+ hours of audio (~300K annotated utterances); (3) A novel ASR-based approach for pronunciation error detection, utilizing our custom Quran Phonetic Script (QPS) to encode Tajweed rules (unlike the IPA standard for Modern Standard Arabic). QPS uses a two-level script: (Phoneme level): Encodes Arabic letters with short/long vowels. (Sifa level): Encodes articulation characteristics of every phoneme. We further include comprehensive modeling with our novel multi-level CTC Model which achieved 0.16% average Phoneme Error Rate (PER) on the testset. We release all code, data, and models as open-source: https://obadx.github.io/prepare-quran-dataset/


SKGE-SWIN: End-To-End Autonomous Vehicle Waypoint Prediction and Navigation Using Skip Stage Swin Transformer

Kartiman, Fachri Najm Noer, Rasim, null, Wihardi, Yaya, Hasanah, Nurul, Natan, Oskar, Wahono, Bambang, Salim, Taufik Ibnu

arXiv.org Artificial Intelligence

Abstract--Focusing on the development of an end-to-end autonomous vehicle model with pixel-to-pixel context awareness, this research proposes the SKGE-Swin architecture. This architecture utilizes the Swin Transformer with a skip-stage mechanism to broaden feature representation globally and at various network levels. This approach enables the model to extract information from distant pixels by leveraging the Swin Transformer's Shifted Window-based Multi-head Self-Attention (SW-MSA) mechanism and to retain critical information from the initial to the final stages of feature extraction, thereby enhancing its capability to comprehend complex patterns in the vehicle's surroundings. The model is evaluated on the CARLA platform using adversarial scenarios to simulate real-world conditions. Experimental results demonstrate that the SKGE-Swin architecture achieves a superior Driving Score compared to previous methods. Furthermore, an ablation study will be conducted to evaluate the contribution of each architectural component, including the influence of skip connections and the use of the Swin Transformer, in improving model performance. Index T erms--multitask learning, autonomous driving, end-to-end learning, skip connections, swin transformer, self-attention mechanism. I. Introduction Autonomous Driving is a complex intelligent system that handles tasks ranging from perception to vehicle control, necessitating distinct modules [1]. The conventional integration of these modules, however, is often intricate and inefficient. Fachri Najm Noer Kartiman is with Department of Computer Science, Indonesia University of Education, Bandung 40154, Indonesia (e-mail: fachri-najmnoer@upi.edu). Rasim is with Department of Computer Science, Indonesia University of Education, Bandung 40154, Indonesia (e-mail: rasim@upi.edu). Y aya Wihardi is with Department of Computer Science, Indonesia University of Education, Bandung 40154, Indonesia (e-mail: yayawihardi@upi.edu). Nurul Hasanah is with Research Center for Smart Mechatronics, National Research and Innovation Agency, Bandung 40135, Indonesia (email:nuru030@brin.go.id). Bambang Wahono is with Research Center for Smart Mechatronics, National Research and Innovation Agency, Bandung 40135, Indonesia (email:bamb047@brin.go.id). Taufik Ibnu Salim is with Research Center for Smart Mechatronics, National Research and Innovation Agency, Bandung 40135, Indonesia (email:tauf021@brin.go.id).