Wang, Cheng
Tricking Retrievers with Influential Tokens: An Efficient Black-Box Corpus Poisoning Attack
Wang, Cheng, Wang, Yiwei, Cai, Yujun, Hooi, Bryan
Retrieval-augmented generation (RAG) systems enhance large language models by incorporating external knowledge, addressing issues like outdated internal knowledge and hallucination. However, their reliance on external knowledge bases makes them vulnerable to corpus poisoning attacks, where adversarial passages can be injected to manipulate retrieval results. Existing methods for crafting such passages, such as random token replacement or training inversion models, are often slow and computationally expensive, requiring either access to retriever's gradients or large computational resources. To address these limitations, we propose Dynamic Importance-Guided Genetic Algorithm (DIGA), an efficient black-box method that leverages two key properties of retrievers: insensitivity to token order and bias towards influential tokens. By focusing on these characteristics, DIGA dynamically adjusts its genetic operations to generate effective adversarial passages with significantly reduced time and memory usage. Our experimental evaluation shows that DIGA achieves superior efficiency and scalability compared to existing methods, while maintaining comparable or better attack success rates across multiple datasets.
Controllable Latent Diffusion for Traffic Simulation
Xiao, Yizhuo, Erden, Mustafa Suphi, Wang, Cheng
The validation of autonomous driving systems benefits greatly from the ability to generate scenarios that are both realistic and precisely controllable. Conventional approaches, such as real-world test drives, are not only expensive but also lack the flexibility to capture targeted edge cases for thorough evaluation. To address these challenges, we propose a controllable latent diffusion that guides the training of diffusion models via reinforcement learning to automatically generate a diverse and controllable set of driving scenarios for virtual testing. Our approach removes the reliance on large-scale real-world data by generating complex scenarios whose properties can be finely tuned to challenge and assess autonomous vehicle systems. Experimental results show that our approach has the lowest collision rate of $0.098$ and lowest off-road rate of $0.096$, demonstrating superiority over existing baselines. The proposed approach significantly improves the realism, stability and controllability of the generated scenarios, enabling more nuanced safety evaluation of autonomous vehicles.
HAD-Gen: Human-like and Diverse Driving Behavior Modeling for Controllable Scenario Generation
Wang, Cheng, Kong, Lingxin, Tamborski, Massimiliano, Albrecht, Stefano V.
--Simulation-based testing has emerged as an essential tool for verifying and validating autonomous vehicles (A Vs). However, contemporary methodologies, such as deterministic and imitation learning-based driver models, struggle to capture the variability of human-like driving behavior . Given these challenges, we propose HAD-Gen, a general framework for realistic traffic scenario generation that simulates diverse human-like driving behaviors. It then employs maximum entropy inverse reinforcement learning on each of the clusters to learn the reward function corresponding to each driving style. Using these reward functions, the method integrates offline reinforcement learning pre-training and multi-agent reinforcement learning algorithms to obtain general and robust driving policies. Multi-perspective simulation results show that our proposed scenario generation framework can simulate diverse, human-like driving behaviors with strong generalization capability. The proposed framework achieves a 90.96% goal-reaching rate, an off-road rate of 2.08%, and a collision rate of 6.91% in the generalization test, outperforming prior approaches by over 20% in goal-reaching performance. UTONOMOUS vehicles (A Vs) represent a groundbreaking advancement in transportation technology, with the potential to enhance road safety, reduce traffic congestion, and improve overall mobility efficiency [1]. Over the past decade, significant progress has been made in transforming A Vs from theoretical concepts into practical implementations, with numerous prototypes successfully navigating urban environments [2]. Despite these advancements, ensuring the safety and reliability of A Vs across diverse real-world scenarios remains a significant challenge [3]. This work was funded by UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number EP/Z533464/1]. Cheng Wang is with the School of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom (e-mail: cheng.wang@hw.ac.uk). Lingxin Kong is with the School of Automation and Software Engineering, Shanxi University, Taiyuan 030031, China (e-mail: 202223601006@email.sxu.edu.cn).
3DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o
Liu, Dingning, Wang, Cheng, Gao, Peng, Zhang, Renrui, Ma, Xinzhu, Meng, Yuan, Wang, Zhihui
Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method, called 3DAxisPrompt, to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding and reasoning ability to real-world 3D scenarios. Besides, we first provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o, as a representative of MLLMs. Finally, we build evaluation environments with four datasets, i.e., ScanRefer, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that MLLMs, with the help of 3DAxisPrompt, can effectively perceive an object's 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.
MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways
Chen, Zhen, Peng, Zhihao, Liang, Xusheng, Wang, Cheng, Liang, Peigan, Zeng, Linsheng, Ju, Minjie, Yuan, Yixuan
Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems.
Mining and Transferring Feature-Geometry Coherence for Unsupervised Point Cloud Registration
Xiong, Kezheng, Xiang, Haoen, Xu, Qingshan, Wen, Chenglu, Shen, Siqi, Li, Jonathan, Wang, Cheng
Point cloud registration, a fundamental task in 3D vision, has achieved remarkable success with learning-based methods in outdoor environments. Unsupervised outdoor point cloud registration methods have recently emerged to circumvent the need for costly pose annotations. However, they fail to establish reliable optimization objectives for unsupervised training, either relying on overly strong geometric assumptions, or suffering from poor-quality pseudo-labels due to inadequate integration of low-level geometric and high-level contextual information. We have observed that in the feature space, latent new inlier correspondences tend to cluster around respective positive anchors that summarize features of existing inliers. Motivated by this observation, we propose a novel unsupervised registration method termed INTEGER to incorporate high-level contextual information for reliable pseudo-label mining. Specifically, we propose the Feature-Geometry Coherence Mining module to dynamically adapt the teacher for each mini-batch of data during training and discover reliable pseudo-labels by considering both high-level feature representations and low-level geometric cues. Furthermore, we propose Anchor-Based Contrastive Learning to facilitate contrastive learning with anchors for a robust feature space. Lastly, we introduce a Mixed-Density Student to learn density-invariant features, addressing challenges related to density variation and low overlap in the outdoor scenario. Extensive experiments on KITTI and nuScenes datasets demonstrate that our INTEGER achieves competitive performance in terms of accuracy and generalizability.
ConDo: Continual Domain Expansion for Absolute Pose Regression
Li, Zijun, Cai, Zhipeng, Yang, Bochun, Shen, Xuelun, Shen, Siqi, Fan, Xiaoliang, Paulitsch, Michael, Wang, Cheng
Visual localization is a fundamental machine learning problem. Absolute Pose Regression (APR) trains a scene-dependent model to efficiently map an input image to the camera pose in a pre-defined scene. However, many applications have continually changing environments, where inference data at novel poses or scene conditions (weather, geometry) appear after deployment. Training APR on a fixed dataset leads to overfitting, making it fail catastrophically on challenging novel data. This work proposes Continual Domain Expansion (ConDo), which continually collects unlabeled inference data to update the deployed APR. Instead of applying standard unsupervised domain adaptation methods which are ineffective for APR, ConDo effectively learns from unlabeled data by distilling knowledge from scene-agnostic localization methods. By sampling data uniformly from historical and newly collected data, ConDo can effectively expand the generalization domain of APR. Large-scale benchmarks with various scene types are constructed to evaluate models under practical (long-term) data changes. ConDo consistently and significantly outperforms baselines across architectures, scene types, and data changes. On challenging scenes (Fig.1), it reduces the localization error by >7x (14.8m vs 1.7m). Analysis shows the robustness of ConDo against compute budgets, replay buffer sizes and teacher prediction noise. Comparing to model re-training, ConDo achieves similar performance up to 25x faster.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
Liao, Bencheng, Chen, Shaoyu, Yin, Haoran, Jiang, Bo, Wang, Cheng, Yan, Sixu, Zhang, Xinbang, Li, Xiangyu, Zhang, Ying, Zhang, Qian, Wang, Xinggang
Recently, the diffusion model has emerged as a powerful generative technique for robotic policy learning, capable of modeling multi-mode action distributions. Leveraging its capability for end-to-end autonomous driving is a promising direction. However, the numerous denoising steps in the robotic diffusion policy and the more dynamic, open-world nature of traffic scenes pose substantial challenges for generating diverse driving actions at a real-time speed. To address these challenges, we propose a novel truncated diffusion policy that incorporates prior multi-mode anchors and truncates the diffusion schedule, enabling the model to learn denoising from anchored Gaussian distribution to the multi-mode driving action distribution. Additionally, we design an efficient cascade diffusion decoder for enhanced interaction with conditional scene context. The proposed model, DiffusionDrive, demonstrates 10$\times$ reduction in denoising steps compared to vanilla diffusion policy, delivering superior diversity and quality in just 2 steps. On the planning-oriented NAVSIM dataset, with the aligned ResNet-34 backbone, DiffusionDrive achieves 88.1 PDMS without bells and whistles, setting a new record, while running at a real-time speed of 45 FPS on an NVIDIA 4090. Qualitative results on challenging scenarios further confirm that DiffusionDrive can robustly generate diverse plausible driving actions. Code and model will be available at https://github.com/hustvl/DiffusionDrive.
HotSpot: Screened Poisson Equation for Signed Distance Function Optimization
Wang, Zimo, Wang, Cheng, Yoshino, Taiki, Tao, Sirui, Fu, Ziyang, Li, Tzu-Mao
Existing losses such as the eikonal loss is to ensure that the implicit function indeed outputs cannot guarantee the recovered implicit function to be a the signed distance. A standard regularization loss used is distance function, even when the implicit function satisfies the eikonal equation: it constrains the norm of the gradient the eikonal equation almost everywhere. Furthermore, the of an implicit function to be 1 almost everywhere. If eikonal loss suffers from stability issues in optimization and the implicit function is a signed distance function, then it the remedies that introduce area or divergence minimization satisfies the eikonal equation. However, the converse is not can lead to oversmoothing. We address these challenges true. Figure 1 shows an example: on the left, we optimize by designing a loss function that when minimized can an implicit function to satisfy the eikonal equation, while converge to the true distance function, is stable, and naturally it successfully does so, it converges to a solution that is far penalize large surface area. We provide theoretical from the actual distance [5, 6].
Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models
Cong, Youan, Wang, Cheng, Akash, Pritom Saha, Chang, Kevin Chen-Chuan
We introduce the Extract-Refine-Retrieve-Read (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.