Goto

Collaborating Authors

 Wang, Yanbo


Breaking Focus: Contextual Distraction Curse in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) (Zhou et al., 2023b) have demonstrated remarkable capabilities across various Natural Language Processing (NLP) tasks, revolutionizing wide downstream applications such as medicine (Zhao et al., 2023), education (Kasneci et al., 2023), and science (Li et al., 2024b; Guo et al., 2023; Huang et al., 2024e). Despite their impressive performance, recent studies have exposed various vulnerabilities in LLMs, including susceptibility to jailbreaking attacks (Zou et al., 2023), hallucination issues (Xu et al., 2024b), and consistency problems (Liang et al., 2024; Huang et al., 2024a). These vulnerabilities highlight the limitations of LLMs in handling nuanced and adversarial scenarios, making it critical to uncover and analyze additional weaknesses to improve their reliability. In this work, we investigate a novel vulnerability termed Contextual Distraction Vulnerability (CDV), where semantically coherent but non-essential contextual additions to a question degrade LLM performance. For instance, a customer service chatbot might miss a refund request hidden in a short story about discovering products through social media influencers. Similarly, a technical query about machine learning could be misunderstood if it's preceded by a student's emotional account of exam preparation anxiety. Unlike adversarial attacks that inject semantically meaningless noise into inputs (Zou et al., 2023; Shi et al., 2024) and distraction brought by long-context input (Bai et al., 2023), for CDV, our study demonstrates that semantically coherent without a long context yet contextually distracting modifications are sufficient to disrupt the decision-making process of even the most advanced LLMs. This vulnerability underscores a critical weakness in LLMs' ability to filter out irrelevant information and prioritize core knowledge, which is essential for robust reasoning. Recent studies have demonstrated the powerful generative capabilities of LLM Xu et al. (2024a); Wu et al. (2024), To systematically investigate this vulnerability, we propose a methodology for


Sample Correlation for Fingerprinting Deep Face Recognition

arXiv.org Artificial Intelligence

Noname manuscript No. (will be inserted by the editor) Abstract Face recognition has witnessed remarkable JC to previous methods. However, an off-theshelf Keywords Model Fingerprinting Deep Face face recognition model as a commercial service Recognition could be stolen by model stealing attacks, posing great threats to the rights of the model owner. Model fingerprinting, as a model stealing detection method, aims 1 Introduction to verify whether a suspect model is stolen from the victim model, gaining more and more attention nowadays. In recent years, remarkable advancements in face recognition Previous methods always utilize transferable adversarial have been largely attributable to the development examples as the model fingerprint, but this of deep learning techniques [1]. A common practice for method is known to be sensitive to adversarial defense model owners is to offer their models to clients through and transfer learning techniques. To address this issue, either cloud-based services or client-side software. Generally, we consider the pairwise relationship between samples training deep neural networks, especially deep face instead and propose a novel yet simple model stealing recognition models, is both resource-intensive and financially detection method based on SAmple Correlation burdensome, requiring extensive data collection (SAC).


XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

arXiv.org Artificial Intelligence

Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.


Reconsidering the Performance of GAE in Link Prediction

arXiv.org Artificial Intelligence

Various graph neural networks (GNNs) with advanced training techniques and model designs have been proposed for link prediction tasks. However, outdated baseline models may lead to an overestimation of the benefits provided by these novel approaches. To address this, we systematically investigate the potential of Graph Autoencoders (GAE) by meticulously tuning hyperparameters and utilizing the trick of orthogonal embedding and linear propagation. Our findings reveal that a well-optimized GAE can match the performance of more complex models while offering greater computational efficiency. Link prediction is a fundamental task in the field of graph learning, with applications spanning various domains such as recommendation systems (Zhang & Chen, 2020), drug discovery (Souri et al., 2022) and knowledge graph completion (Zhu et al., 2021b).


AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

arXiv.org Artificial Intelligence

Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information, facilitating a wide range of complex applications and tasks. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs serve as a path to automatic benchmarking?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. Upon receiving an evaluation capability, AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of seven popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability. We observe the following: (1) Our constructed benchmark accurately reflects varying task difficulties; (2) As task difficulty rises, the performance gap between models widens; (3) While models exhibit strong performance in abstract level understanding, they underperform in details reasoning tasks; and (4) Constructing a dataset with varying levels of difficulties is critical for a comprehensive and exhaustive evaluation. Overall, AutoBench-V not only successfully utilizes LVLMs for automated benchmarking but also reveals that LVLMs as judges have significant potential in various domains.


Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

arXiv.org Artificial Intelligence

LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.


AntibodyFlow: Normalizing Flow Model for Designing Antibody Complementarity-Determining Regions

arXiv.org Artificial Intelligence

Therapeutic antibodies have been extensively studied in drug discovery and development in the past decades. Antibodies are specialized protective proteins that bind to antigens in a lock-to-key manner. The binding strength/affinity between an antibody and a specific antigen is heavily determined by the complementarity-determining regions (CDRs) on the antibodies. Existing machine learning methods cast in silico development of CDRs as either sequence or 3D graph (with a single chain) generation tasks and have achieved initial success. However, with CDR loops having specific geometry shapes, learning the 3D geometric structures of CDRs remains a challenge. To address this issue, we propose AntibodyFlow, a 3D flow model to design antibody CDR loops. Specifically, AntibodyFlow first constructs the distance matrix, then predicts amino acids conditioned on the distance matrix. Also, AntibodyFlow conducts constraint learning and constrained generation to ensure valid 3D structures. Experimental results indicate that AntibodyFlow outperforms the best baseline consistently with up to 16.0% relative improvement in validity rate and 24.3% relative reduction in geometric graph level error (root mean square deviation, RMSD).


Precipitation Nowcasting Using Physics Informed Discriminator Generative Models

arXiv.org Artificial Intelligence

Nowcasting leverages real-time atmospheric conditions to forecast weather over short periods. State-of-the-art models, including PySTEPS, encounter difficulties in accurately forecasting extreme weather events because of their unpredictable distribution patterns. In this study, we design a physics-informed neural network to perform precipitation nowcasting using the precipitation and meteorological data from the Royal Netherlands Meteorological Institute (KNMI). This model draws inspiration from the novel Physics-Informed Discriminator GAN (PID-GAN) formulation, directly integrating physics-based supervision within the adversarial learning framework. The proposed model adopts a GAN structure, featuring a Vector Quantization Generative Adversarial Network (VQ-GAN) and a Transformer as the generator, with a temporal discriminator serving as the discriminator. Our findings demonstrate that the PID-GAN model outperforms numerical and SOTA deep generative models in terms of precipitation nowcasting downstream metrics.


Efficient Neural Common Neighbor for Temporal Graph Link Prediction

arXiv.org Artificial Intelligence

Temporal graphs are ubiquitous in real-world scenarios, such as social network, trade and transportation. Predicting dynamic links between nodes in a temporal graph is of vital importance. Traditional methods usually leverage the temporal neighborhood of interaction history to generate node embeddings first and then aggregate the source and target node embeddings to predict the link. However, such methods focus on learning individual node representations, but overlook the pairwise representation learning nature of link prediction and fail to capture the important pairwise features of links such as common neighbors (CN). Motivated by the success of Neural Common Neighbor (NCN) for static graph link prediction, we propose TNCN, a temporal version of NCN for link prediction in temporal graphs. TNCN dynamically updates a temporal neighbor dictionary for each node, and utilizes multi-hop common neighbors between the source and target node to learn a more effective pairwise representation. We validate our model on five large-scale real-world datasets from the Temporal Graph Benchmark (TGB), and find that it achieves new state-of-the-art performance on three of them. Additionally, TNCN demonstrates excellent scalability on large datasets, outperforming popular GNN baselines by up to 6.4 times in speed. Our code is available at https: //github.com/GraphPKU/TNCN.


Multi-Modal UAV Detection, Classification and Tracking Algorithm -- Technical Report for CVPR 2024 UG2 Challenge

arXiv.org Artificial Intelligence

This technical report presents the 1st winning model for UG2+, a task in CVPR 2024 UAV Tracking and Pose-Estimation Challenge. This challenge faces difficulties in drone detection, UAV-type classification and 2D/3D trajectory estimation in extreme weather conditions with multi-modal sensor information, including stereo vision, various Lidars, Radars, and audio arrays. Leveraging this information, we propose a multi-modal UAV detection, classification, and 3D tracking method for accurate UAV classification and tracking. A novel classification pipeline which incorporates sequence fusion, region of interest (ROI) cropping, and keyframe selection is proposed. Our system integrates cutting-edge classification techniques and sophisticated post-processing steps to boost accuracy and robustness. The designed pose estimation pipeline incorporates three modules: dynamic points analysis, a multi-object tracker, and trajectory completion techniques. Extensive experiments have validated the effectiveness and precision of our approach. In addition, we also propose a novel dataset pre-processing method and conduct a comprehensive ablation study for our design. We finally achieved the best performance in the classification and tracking of the MMUAD dataset. The code and configuration of our method are available at https://github.com/dtc111111/Multi-Modal-UAV.