Goto

Collaborating Authors

 Dong, Yiting


Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

arXiv.org Artificial Intelligence

As large language models (LLMs) become integral to various applications, ensuring both their safety and utility is paramount. Jailbreak attacks, which manipulate LLMs into generating harmful content, pose significant challenges to this balance. Existing defenses, such as prompt engineering and safety fine-tuning, often introduce computational overhead, increase inference latency, and lack runtime flexibility. In this paper, we introduce Jailbreak Antidote, a method that enables real-time adjustment of LLM safety preferences by manipulating a sparse subset of the model's internal states during inference. By shifting the model's hidden representations along a safety direction with varying strengths, we achieve flexible control over the safety-utility balance without additional token overhead or inference delays. Our analysis reveals that safety-related information in LLMs is sparsely distributed; adjusting approximately 5% of the internal state is as effective as modifying the entire state. Extensive experiments on nine LLMs (ranging from 2 billion to 72 billion parameters), evaluated against ten jailbreak attack methods and compared with six defense strategies, validate the effectiveness and efficiency of our approach. By directly manipulating internal states during reasoning, Jailbreak Antidote offers a lightweight, scalable solution that enhances LLM safety while preserving utility, opening new possibilities for real-time safety mechanisms in widely-deployed AI systems. Large language models (LLMs) have revolutionized natural language processing, demonstrating advanced cognitive abilities and significantly impacting various aspects of daily life. They excel in instruction understanding (Ouyang et al., 2022; Chung et al., 2024), summarization (Chung et al., 2024), and complex reasoning tasks (Kojima et al., 2022; Wang & Zhou, 2024). Applications built upon LLMs are widespread, enhancing efficiency and convenience in domains such as coding assistance (Roziere et al., 2023), medical diagnostics (Singhal et al., 2023), financial analysis (Li et al., 2023), and psychological counseling (Strachan et al., 2024; Xu et al., 2024). Given their pervasive use and profound social impact, ensuring the safety and utility of LLMs has become critically important. A central challenge in deploying LLMs is balancing safety and utility.


Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. Existing attack methods are fixed or specifically tailored for certain models and cannot flexibly adjust attack strength, which is critical for generalization when attacking models of various sizes. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources. Our method involves engaging the LLM in a resource-intensive preliminary task--a Character Map lookup and decoding process--before presenting the target instruction. By saturating the model's processing capacity, we prevent the activation of safety protocols when processing the subsequent instruction. Extensive experiments on state-of-the-art LLMs demonstrate that our method achieves a high success rate in bypassing safety measures without requiring gradient access, manual prompt engineering. We verified our approach offers a scalable attack that quantifies attack strength and adapts to different model scales at the optimal strength. We shows safety policies of LLMs might be more susceptible to resource constraints. Our findings reveal a critical vulnerability in current LLM safety designs, highlighting the need for more robust defense strategies that account for resource-intense condition. Large Language Models (LLMs), by learning from millions of diverse text sources, possess the ability to transfer knowledge across domains (Achiam et al., 2023; Touvron et al., 2023; Jiang et al., 2023).


StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?

arXiv.org Artificial Intelligence

Human beings often experience stress, which can significantly influence their performance. This study explores whether Large Language Models (LLMs) exhibit stress responses similar to those of humans and whether their performance fluctuates under different stress-inducing prompts. To investigate this, we developed a novel set of prompts, termed StressPrompt, designed to induce varying levels of stress. These prompts were derived from established psychological frameworks and carefully calibrated based on ratings from human participants. We then applied these prompts to several LLMs to assess their responses across a range of tasks, including instruction-following, complex reasoning, and emotional intelligence. The findings suggest that LLMs, like humans, perform optimally under moderate stress, consistent with the Yerkes-Dodson law. Notably, their performance declines under both low and high-stress conditions. Our analysis further revealed that these StressPrompts significantly alter the internal states of LLMs, leading to changes in their neural representations that mirror human responses to stress. This research provides critical insights into the operational robustness and flexibility of LLMs, demonstrating the importance of designing AI systems capable of maintaining high performance in real-world scenarios where stress is prevalent, such as in customer service, healthcare, and emergency response contexts. Moreover, this study contributes to the broader AI research community by offering a new perspective on how LLMs handle different scenarios and their similarities to human cognition.


Brain-inspired and Self-based Artificial Intelligence

arXiv.org Artificial Intelligence

The question "Can machines think?" and the Turing Test to assess whether machines could achieve human-level intelligence is one of the roots of AI. With the philosophical argument "I think, therefore I am", this paper challenge the idea of a "thinking machine" supported by current AIs since there is no sense of self in them. Current artificial intelligence is only seemingly intelligent information processing and does not truly understand or be subjectively aware of oneself and perceive the world with the self as human intelligence does. In this paper, we introduce a Brain-inspired and Self-based Artificial Intelligence (BriSe AI) paradigm. This BriSe AI paradigm is dedicated to coordinating various cognitive functions and learning strategies in a self-organized manner to build human-level AI models and robotic applications. Specifically, BriSe AI emphasizes the crucial role of the Self in shaping the future AI, rooted with a practical hierarchical Self framework, including Perception and Learning, Bodily Self, Autonomous Self, Social Self, and Conceptual Self. The hierarchical framework of the Self highlights self-based environment perception, self-bodily modeling, autonomous interaction with the environment, social interaction and collaboration with others, and even more abstract understanding of the Self. Furthermore, the positive mutual promotion and support among multiple levels of Self, as well as between Self and learning, enhance the BriSe AI's conscious understanding of information and flexible adaptation to complex environments, serving as a driving force propelling BriSe AI towards real Artificial General Intelligence.


Astrocyte-Enabled Advancements in Spiking Neural Networks for Large Language Modeling

arXiv.org Artificial Intelligence

Within the complex neuroarchitecture of the brain, astrocytes play crucial roles in development, structure, and metabolism. These cells regulate neural activity through tripartite synapses, directly impacting cognitive processes such as learning and memory. Despite the growing recognition of astrocytes' significance, traditional Spiking Neural Network (SNN) models remain predominantly neuron-centric, overlooking the profound influence of astrocytes on neural dynamics. Inspired by these biological insights, we have developed an Astrocyte-Modulated Spiking Unit (AM-SU), an innovative framework that integrates neuron-astrocyte interactions into the computational paradigm, demonstrating wide applicability across various hardware platforms. Our Astrocyte-Modulated Spiking Neural Network (AstroSNN) exhibits exceptional performance in tasks involving memory retention and natural language generation, particularly in handling long-term dependencies and complex linguistic structures. The design of AstroSNN not only enhances its biological authenticity but also introduces novel computational dynamics, enabling more effective processing of complex temporal dependencies. Furthermore, AstroSNN shows low latency, high throughput, and reduced memory usage in practical applications, making it highly suitable for resource-constrained environments. By successfully integrating astrocytic dynamics into intelligent neural networks, our work narrows the gap between biological plausibility and neural modeling, laying the groundwork for future biologically-inspired neural computing research that includes both neurons and astrocytes.


Temporal Knowledge Sharing enable Spiking Neural Network Learning from Past and Future

arXiv.org Artificial Intelligence

Spiking Neural Networks (SNNs) have attracted significant attention from researchers across various domains due to their brain-like information processing mechanism. However, SNNs typically grapple with challenges such as extended time steps, low temporal information utilization, and the requirement for consistent time step between testing and training. These challenges render SNNs with high latency. Moreover, the constraint on time steps necessitates the retraining of the model for new deployments, reducing adaptability. To address these issues, this paper proposes a novel perspective, viewing the SNN as a temporal aggregation model. We introduce the Temporal Knowledge Sharing (TKS) method, facilitating information interact between different time points. TKS can be perceived as a form of temporal self-distillation. To validate the efficacy of TKS in information processing, we tested it on static datasets like CIFAR10, CIFAR100, ImageNet-1k, and neuromorphic datasets such as DVS-CIFAR10 and NCALTECH101. Experimental results demonstrate that our method achieves state-of-the-art performance compared to other algorithms. Furthermore, TKS addresses the temporal consistency challenge, endowing the model with superior temporal generalization capabilities. This allows the network to train with longer time steps and maintain high performance during testing with shorter time steps. Such an approach considerably accelerates the deployment of SNNs on edge devices. Finally, we conducted ablation experiments and tested TKS on fine-grained tasks, with results showcasing TKS's enhanced capability to process information efficiently.


N-Omniglot, a large-scale neuromorphic dataset for spatio-temporal sparse few-shot learning

arXiv.org Artificial Intelligence

Few-shot learning (learning with a few samples) is one of the most important cognitive abilities of the human brain. However, the current artificial intelligence systems meet difficulties in achieving this ability. Similar challenges also exist for biologically plausible spiking neural networks (SNNs). Datasets for traditional few-shot learning domains provide few amounts of temporal information. and the absence of neuromorphic datasets has hindered the development of few-shot learning for SNNs. Here, to the best of our knowledge, we provide the first neuromorphic dataset for few-shot learning using SNNs: N-Omniglot, based on the Dynamic Vision Sensor. It contains 1,623 categories of handwritten characters, with only 20 samples per class. N-Omniglot eliminates the need for a neuromorphic dataset for SNNs with high spareness and tremendous temporal coherence. Additionally, the dataset provides a powerful challenge and a suitable benchmark for developing SNNs algorithms in the few-shot learning domain due to the chronological information of strokes. We also provide the improved nearest neighbor, convolutional network, SiameseNet, and meta-learning algorithm in the spiking version for verification.