Goto

Collaborating Authors

 Zhao, Zhe


Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses. These attacks exploit LLMs' difficulty in dynamically detecting harmful intents during the generation process. Traditional safety alignment methods, often relying on the initial few generation steps, are ineffective due to limited computational budget. This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content, significantly improving both the computational budget and effectiveness of mitigating harmful generation. Our approach uses a hybrid loss function operating on hidden states to directly improve LLMs' inherent awareness of toxity during generation. Furthermore, we redefine safe responses by generating semantically relevant answers to harmful queries, thereby increasing robustness against representation-mutation attacks. Evaluations across multiple LLMs demonstrate state-of-the-art defense performance against six different attack types, reducing Attack Success Rates by up to two orders of magnitude compared to previous state-of-the-art defense while preserving utility. This work advances LLM safety by addressing limitations of conventional alignment through dynamic, context-aware mitigation.


From Understanding to Excelling: Template-Free Algorithm Design through Structural-Functional Co-Evolution

arXiv.org Artificial Intelligence

Large language models (LLMs) have greatly accelerated the automation of algorithm generation and optimization. However, current methods such as EoH and FunSearch mainly rely on predefined templates and expert-specified functions that focus solely on the local evolution of key functionalities. Consequently, they fail to fully leverage the synergistic benefits of the overall architecture and the potential of global optimization. In this paper, we introduce an end-to-end algorithm generation and optimization framework based on LLMs. Our approach utilizes the deep semantic understanding of LLMs to convert natural language requirements or human-authored papers into code solutions, and employs a two-dimensional co-evolution strategy to optimize both functional and structural aspects. This closed-loop process spans problem analysis, code generation, and global optimization, automatically identifying key algorithm modules for multi-level joint optimization and continually enhancing performance and design innovation. Extensive experiments demonstrate that our method outperforms traditional local optimization approaches in both performance and innovation, while also exhibiting strong adaptability to unknown environments and breakthrough potential in structural design. By building on human research, our framework generates and optimizes novel algorithms that surpass those designed by human experts, broadening the applicability of LLMs for algorithm design and providing a novel solution pathway for automated algorithm development.


BadJudge: Backdoor Vulnerabilities of LLM-as-a-Judge

arXiv.org Artificial Intelligence

This paper proposes a novel backdoor threat attacking the LLM-as-a-Judge evaluation regime, where the adversary controls both the candidate and evaluator model. The backdoored evaluator victimizes benign users by unfairly assigning inflated scores to adversary. A trivial single token backdoor poisoning 1% of the evaluator training data triples the adversary's score with respect to their legitimate score. We systematically categorize levels of data access corresponding to three real-world settings, (1) web poisoning, (2) malicious annotator, and (3) weight poisoning. These regimes reflect a weak to strong escalation of data access that highly correlates with attack severity. Under the weakest assumptions - web poisoning (1), the adversary still induces a 20% score inflation. Likewise, in the (3) weight poisoning regime, the stronger assumptions enable the adversary to inflate their scores from 1.5/5 to 4.9/5. The backdoor threat generalizes across different evaluator architectures, trigger designs, evaluation tasks, and poisoning rates. By poisoning 10% of the evaluator training data, we control toxicity judges (Guardrails) to misclassify toxic prompts as non-toxic 89% of the time, and document reranker judges in RAG to rank the poisoned document first 97% of the time. LLM-as-a-Judge is uniquely positioned at the intersection of ethics and technology, where social implications of mislead model selection and evaluation constrain the available defensive tools. Amidst these challenges, model merging emerges as a principled tool to offset the backdoor, reducing ASR to near 0% whilst maintaining SOTA performance. Model merging's low computational cost and convenient integration into the current LLM Judge training pipeline position it as a promising avenue for backdoor mitigation in the LLM-as-a-Judge setting.


LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model

arXiv.org Artificial Intelligence

Existing low-rank adaptation (LoRA) methods face challenges on sparse large language models (LLMs) due to the inability to maintain sparsity. Recent works introduced methods that maintain sparsity by augmenting LoRA techniques with additional masking mechanisms. Despite these successes, such approaches suffer from an increased memory and computation overhead, which affects efficiency of LoRA methods. In response to this limitation, we introduce LoRS, an innovative method designed to achieve both memory and computation efficiency when fine-tuning sparse LLMs. To mitigate the substantial memory and computation demands associated with preserving sparsity, our approach incorporates strategies of weight recompute and computational graph rearrangement. In addition, we also improve the effectiveness of LoRS through better adapter initialization. These innovations lead to a notable reduction in memory and computation consumption during the fine-tuning phase, all while achieving performance levels that outperform existing LoRA approaches.


Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

arXiv.org Artificial Intelligence

Generating realistic human grasps is crucial yet challenging for object manipulation in computer graphics and robotics. Current methods often struggle to generate detailed and realistic grasps with full finger-object interaction, as they typically rely on encoding the entire hand and estimating both posture and position in a single step. Additionally, simulating object deformation during grasp generation is still difficult, as modeling such deformation requires capturing the comprehensive relationship among points of the object's surface. To address these limitations, we propose a novel improved Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand into distinct parts and encodes them separately. This part-aware architecture allows for more precise management of hand-object interactions. Furthermore, we introduce a dual-stage decoding strategy that first predicts the grasp type under skeletal constraints and then identifies the optimal grasp position, enhancing both the realism and adaptability of the model to unseen interactions. Furthermore, we introduce a new Mesh UFormer as the backbone network to extract the hierarchical structural representations from the mesh and propose a new normal vector-guided position encoding to simulate the hand-object deformation. In experiments, our model achieves a relative improvement of approximately 14.1% in grasp quality compared to state-of-the-art methods across four widely used benchmarks. Our comparisons with other backbone networks show relative improvements of 2.23% in Hand-object Contact Distance and 5.86% in Quality Index on deformable and rigid object based datasets, respectively. Our source code and model are available at https://github.com/florasion/D-VQVAE.


Training Verification-Friendly Neural Networks via Neuron Behavior Consistency

arXiv.org Artificial Intelligence

Formal verification provides critical security assurances for neural networks, yet its practical application suffers from the long verification time. This work introduces a novel method for training verification-friendly neural networks, which are robust, easy to verify, and relatively accurate. Our method integrates neuron behavior consistency into the training process, making neuron activation states remain consistent across different inputs within a local neighborhood. This reduces the number of unstable neurons and tightens the bounds of neurons thereby enhancing the network's verifiability. We evaluated our method using the MNIST, Fashion-MNIST, and CIFAR-10 datasets with various network architectures. The experimental results demonstrate that networks trained using our method are verification-friendly across different radii and architectures, whereas other tools fail to maintain verifiability as the radius increases. Additionally, we show that our method can be combined with existing approaches to further improve the verifiability of networks.


Joint Knowledge Editing for Information Enrichment and Probability Promotion

arXiv.org Artificial Intelligence

Knowledge stored in large language models requires timely updates to reflect the dynamic nature of real-world information. To update the knowledge, most knowledge editing methods focus on the low layers, since recent probes into the knowledge recall process reveal that the answer information is enriched in low layers. However, these probes only and could only reveal critical recall stages for the original answers, while the goal of editing is to rectify model's prediction for the target answers. This inconsistency indicates that both the probe approaches and the associated editing methods are deficient. To mitigate the inconsistency and identify critical editing regions, we propose a contrast-based probe approach, and locate two crucial stages where the model behavior diverges between the original and target answers: Information Enrichment in low layers and Probability Promotion in high layers. Building upon the insights, we develop the Joint knowledge Editing for information Enrichment and probability Promotion (JEEP) method, which jointly edits both the low and high layers to modify the two critical recall stages. Considering the mutual interference and growing forgetting due to dual modifications, JEEP is designed to ensure that updates to distinct regions share the same objectives and are complementary. We rigorously evaluate JEEP by editing up to thousands of facts on various models, i.e., GPT-J (6B) and LLaMA (7B), and addressing diverse editing objectives, i.e., adding factual and counterfactual knowledge. In all tested scenarios, JEEP achieves best performances, validating the effectiveness of the revealings of our probe approach and the designs of our editing method. Our code and data are available at https://github.com/Eric8932/JEEP.


GraphLoRA: Empowering LLMs Fine-Tuning via Graph Collaboration of MoE

arXiv.org Artificial Intelligence

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that has been widely adopted in various downstream applications of LLMs. Together with the Mixture-of-Expert (MoE) technique, fine-tuning approaches have shown remarkable improvements in model capability. However, the coordination of multiple experts in existing studies solely relies on the weights assigned by the simple router function. Lack of communication and collaboration among experts exacerbate the instability of LLMs due to the imbalance load problem of MoE. To address this issue, we propose a novel MoE graph-based LLM fine-tuning framework GraphLoRA, in which a graph router function is designed to capture the collaboration signals among experts by graph neural networks (GNNs). GraphLoRA enables all experts to understand input knowledge and share information from neighbor experts by aggregating operations. Besides, to enhance each expert's capability and their collaborations, we design two novel coordination strategies: the Poisson distribution-based distinction strategy and the Normal distribution-based load balance strategy. Extensive experiments on four real-world datasets demonstrate the effectiveness of our GraphLoRA in parameter-efficient fine-tuning of LLMs, showing the benefits of facilitating collaborations of multiple experts in the graph router of GraphLoRA.


Analytic Continuation by Feature Learning

arXiv.org Machine Learning

Analytic continuation aims to reconstruct real-time spectral functions from imaginary-time Green's functions; however, this process is notoriously ill-posed and challenging to solve. We propose a novel neural network architecture, named the Feature Learning Network (FL-net), to enhance the prediction accuracy of spectral functions, achieving an improvement of at least $20\%$ over traditional methods, such as the Maximum Entropy Method (MEM), and previous neural network approaches. Furthermore, we develop an analytical method to evaluate the robustness of the proposed network. Using this method, we demonstrate that increasing the hidden dimensionality of FL-net, while leading to lower loss, results in decreased robustness. Overall, our model provides valuable insights into effectively addressing the complex challenges associated with analytic continuation.


Global Challenge for Safe and Secure LLMs Track 1

arXiv.org Artificial Intelligence

This paper introduces the Global Challenge for Safe and Secure Large Language Models (LLMs), a pioneering initiative organized by AI Singapore (AISG) and the CyberSG R&D Programme Office (CRPO) to foster the development of advanced defense mechanisms against automated jailbreaking attacks. With the increasing integration of LLMs in critical sectors such as healthcare, finance, and public administration, ensuring these models are resilient to adversarial attacks is vital for preventing misuse and upholding ethical standards. This competition focused on two distinct tracks designed to evaluate and enhance the robustness of LLM security frameworks. Track 1 tasked participants with developing automated methods to probe LLM vulnerabilities by eliciting undesirable responses, effectively testing the limits of existing safety protocols within LLMs. Participants were challenged to devise techniques that could bypass content safeguards across a diverse array of scenarios, from offensive language to misinformation and illegal activities. Through this process, Track 1 aimed to deepen the understanding of LLM vulnerabilities and provide insights for creating more resilient models.