Wolf, Lior
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Goren, Gil, Katz, Shahar, Wolf, Lior
Large Language Models (LLMs) are vulnerable to adversarial attacks that bypass safety guidelines and generate harmful content. Mitigating these vulnerabilities requires defense mechanisms that are both robust and computationally efficient. However, existing approaches either incur high computational costs or rely on lightweight defenses that can be easily circumvented, rendering them impractical for real-world LLM-based systems. In this work, we introduce the AlignTree defense, which enhances model alignment while maintaining minimal computational overhead. AlignTree monitors LLM activations during generation and detects misaligned behavior using an efficient random forest classifier. This classifier operates on two signals: (i) the refusal direction -- a linear representation that activates on misaligned prompts, and (ii) an SVM-based signal that captures non-linear features associated with harmful content. Unlike previous methods, AlignTree does not require additional prompts or auxiliary guard models. Through extensive experiments, we demonstrate the efficiency and robustness of AlignTree across multiple LLMs and benchmarks.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > China > Hong Kong (0.04)
- Research Report > New Finding (0.68)
- Research Report > Promising Solution (0.46)
Overflow Prevention Enhances Long-Context Recurrent LLMs
Ben-Kish, Assaf, Zimerman, Itamar, Mirza, M. Jehanzeb, Wolf, Lior, Glass, James, Karlinsky, Leonid, Giryes, Raja
A recent trend in LLMs is developing recurrent sub-quadratic models that improve long-context processing efficiency. We investigate leading large long-context models, focusing on how their fixed-size recurrent memory affects their performance. Our experiments reveal that, even when these models are trained for extended contexts, their use of long contexts remains underutilized. Specifically, we demonstrate that a chunk-based inference procedure, which identifies and processes only the most relevant portion of the input can mitigate recurrent memory failures and be effective for many long-context tasks: On LongBench, our method improves the overall performance of Falcon3-Mamba-Inst-7B by 14%, Falcon-Mamba-Inst-7B by 28%, RecurrentGemma-IT-9B by 50%, and RWKV6-Finch-7B by 51%. Surprisingly, this simple approach also leads to state-of-the-art results in the challenging LongBench v2 benchmark, showing competitive performance with equivalent size Transformers. Furthermore, our findings raise questions about whether recurrent models genuinely exploit long-range dependencies, as our single-chunk strategy delivers stronger performance - even in tasks that presumably require cross-context relations.
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models
Ali, Ameen, Katz, Shahar, Wolf, Lior, Titov, Ivan
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (7 more...)
- Government > Regional Government (0.46)
- Health & Medicine > Therapeutic Area > Neurology (0.46)
- Education (0.35)
Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs
Eisenstadt, Roy, Zimerman, Itamar, Wolf, Lior
Recently, techniques such as explicit structured reasoning have demonstrated strong test-time scaling behavior by enforcing a separation between the model's internal "thinking" process and the final response. A key factor influencing answer quality in this setting is the length of the thinking stage. When the reasoning is too short, the model may fail to capture the complexity of the task. Conversely, when it is too long, the model may overthink, leading to unnecessary computation and degraded performance. This paper explores and exploits the underlying mechanisms by which LLMs understand and regulate the length of their reasoning during explicit thought processes. First, we show that LLMs encode their progress through the reasoning process and introduce an interactive progress bar visualization, which is then used to reveal insights on the model's planning dynamics. Second, we manipulate the internal progress encoding during inference to reduce unnecessary steps and generate a more concise and decisive chain of thoughts. Our empirical results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency. Our code is publicly available.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)
- Europe > Austria > Burgenland > Eisenstadt (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
Revisiting LRP: Positional Attribution as the Missing Ingredient for Transformer Explainability
Bakish, Yarden, Zimerman, Itamar, Chefer, Hila, Wolf, Lior
The development of effective explainability tools for Transformers is a crucial pursuit in deep learning research. One of the most promising approaches in this domain is Layer-wise Relevance Propagation (LRP), which propagates relevance scores backward through the network to the input space by redistributing activation values based on predefined rules. However, existing LRP-based methods for Transformer explainability entirely overlook a critical component of the Transformer architecture: its positional encoding (PE), resulting in violation of the conservation property, and the loss of an important and unique type of relevance, which is also associated with structural and positional features. To address this limitation, we reformulate the input space for Transformer explainability as a set of position-token pairs. This allows us to propose specialized theoretically-grounded LRP rules designed to propagate attributions across various positional encoding methods, including Rotary, Learnable, and Absolute PE. Extensive experiments with both fine-tuned classifiers and zero-shot foundation models, such as LLaMA 3, demonstrate that our method significantly outperforms the state-of-the-art in both vision and NLP explainability tasks. Our code is publicly available.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > Middle East > Jordan (0.04)
On the Expressivity of Selective State-Space Layers: A Multivariate Polynomial Approach
Cohen-Karlik, Edo, Zimerman, Itamar, Galanti, Liane, Atad, Ido, Globerson, Amir, Wolf, Lior
Recent advances in efficient sequence modeling have introduced selective state-space layers, a key component of the Mamba architecture, which have demonstrated remarkable success in a wide range of NLP and vision tasks. While Mamba's empirical performance has matched or surpassed SoTA transformers on such diverse benchmarks, the theoretical foundations underlying its powerful representational capabilities remain less explored. In this work, we investigate the expressivity of selective state-space layers using multivariate polynomials, and prove that they surpass linear transformers in expressiveness. Consequently, our findings reveal that Mamba offers superior representational power over linear attention-based models for long sequences, while not sacrificing their generalization. Our theoretical insights are validated by a comprehensive set of empirical experiments on various datasets.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Deep Active Speech Cancellation with Multi-Band Mamba Network
Mishaly, Yehuda, Wolf, Lior, Nachmani, Eliya
We present a novel deep learning network for Active Speech Cancellation (ASC), advancing beyond Active Noise Cancellation (ANC) methods by effectively canceling both noise and speech signals. The proposed Multi-Band Mamba architecture segments input audio into distinct frequency bands, enabling precise anti-signal generation and improved phase alignment across frequencies. Additionally, we introduce an optimization-driven loss function that provides near-optimal supervisory signals for anti-signal generation. Experimental results demonstrate substantial performance gains, achieving up to 7.2dB improvement in ANC scenarios and 6.2dB in ASC, significantly outperforming existing methods. Audio samples are available at https://mishalydev.github.io/DeepASC-Demo
- Europe > Netherlands > North Brabant > Eindhoven (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Classifier-Guided Captioning Across Modalities
Shaulov, Ariel, Shaharabany, Tal, Shaar, Eitan, Chechik, Gal, Wolf, Lior
Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This limitation hinders performance in tasks like audio or video captioning, where different semantic cues are needed. Addressing this challenge is crucial for creating more adaptable and versatile captioning frameworks applicable across diverse real-world contexts. In this work, we introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning, where it is crucial to describe sounds and their sources. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. The classifier is trained on a dataset automatically generated by GPT-4, using tailored prompts specifically designed to enhance key aspects of the generated captions. Importantly, the framework operates solely during inference, eliminating the need for further training of the underlying captioning model. We evaluate the framework on various models and modalities, with a focus on audio captioning, and report promising results. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
- Asia > Singapore (0.04)
Segment-Based Attention Masking for GPTs
Katz, Shahar, Ringel, Liran, Romano, Yaniv, Wolf, Lior
Modern Language Models (LMs) owe much of their success to masked causal attention, the backbone of Generative Pre-Trained Transformer (GPT) models. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial "prefill" phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. This Segment-by-Segment scheme entails no additional computational overhead. When integrating it into models such as Llama and Qwen, state-of-the-art performance is consistently achieved.
- North America > United States (0.14)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- Asia > China > Hong Kong (0.04)
Reversed Attention: On The Gradient Descent Of Attention Layers In GPT
Katz, Shahar, Wolf, Lior
The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward pass of LMs, the backward pass of attention has been largely overlooked. In this work, we study the mathematics of the backward pass of attention, revealing that it implicitly calculates an attention matrix we refer to as "Reversed Attention". We examine the properties of Reversed Attention and demonstrate its ability to elucidate the models' behavior and edit dynamics. In an experimental setup, we showcase the ability of Reversed Attention to directly alter the forward pass of attention, without modifying the model's weights, using a novel method called "attention patching". In addition to enhancing the comprehension of how LM configure attention layers during backpropagation, Reversed Attention maps contribute to a more interpretable backward pass. Our code will be available at: https://github.
- Europe > France (0.05)
- Europe > Italy > Marche > Ancona Province > Ancona (0.04)
- Asia > Singapore (0.04)
- (7 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)