Chen, Yiran
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Liu, Yudong, Sun, Jingwei, Lin, Yueqian, Zhang, Jingyang, Yin, Ming, Wang, Qinsi, Zhang, Jianyi, Li, Hai, Chen, Yiran
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility
Kuo, Martin, Zhang, Jingyang, Zhang, Jianyi, Tang, Minxue, DiValentin, Louis, Ding, Aolin, Sun, Jingwei, Chen, William, Hass, Amin, Chen, Tianlong, Chen, Yiran, Li, Hai
With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM's functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%, all while maintaining comparable model utility performance. Large Language Models (LLMs) (Touvron et al., 2023; Achiam et al., 2023; Team et al., 2023; Dubey et al., 2024) have achieved remarkable success in recent years, with their wide adoption either as general-purpose models or, after fine-tuning, as specialized and personal assistants. Despite their success, LLMs with huge parameter counts and great capacity in the meantime exhibit the concerning "memorization" phenomenons (Carlini et al., 2019; 2021), i.e., they can precisely memorize some training data. Such memorization is vulnerable to various attacks (e.g., membership inference attacks and data extraction attacks) and risks severe privacy breaches. One of the most serious concerns comes from the attacks that aim to extract personal identifiable information (PII) memorized by the models, which compromise users' privacy and are likely to cause real-world harm consequently. To defend against such PII or data extraction attacks, several machine unlearning techniques have been applied to LLMs. However, existing methods typically fall short in terms of the trade-off between the defense performance and model utility. For example, most unlearning approaches are based on gradient ascent (Jang et al., 2022; Wang et al., 2024) and often adversely affect model functionalities to an extent where the model cannot handle their original tasks anymore and thus becomes no longer useful.
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
Kuo, Martin, Zhang, Jianyi, Ding, Aolin, Wang, Qinsi, DiValentin, Louis, Bao, Yujia, Wei, Wei, Juan, Da-Cheng, Li, Hai, Chen, Yiran
Warning: This paper contains potentially offensive and harmful text. Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks--using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hฤณacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model's own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline--dropping from 98% to below 2%--and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.
Lorecast: Layout-Aware Performance and Power Forecasting from Natural Language
Wang, Runzhi, Sengupta, Prianka, Chen, Yiran, Hu, Jiang
In chip design planning, obtaining reliable performance and power forecasts for various design options is of critical importance. Traditionally, this involves using system-level models, which often lack accuracy, or trial synthesis, which is both labor-intensive and time-consuming. We introduce a new methodology, called Lorecast, which accepts English prompts as input to rapidly generate layout-aware performance and power estimates. This approach bypasses the need for HDL code development or synthesis, making it both fast and user-friendly. Experimental results demonstrate that Lorecast achieves accuracy within a few percent of error compared to post-layout analysis.
A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection
Linderman, Randolph W., Chen, Yiran, Linderman, Scott W.
Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.
Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives
Wang, Qinsi, Ke, Jinghan, Tomizuka, Masayoshi, Chen, Yiran, Keutzer, Kurt, Xu, Chenfeng
We provide a new LLM-compression solution via SVD, unlocking new possibilities for LLM compression beyond quantization and pruning. We point out that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance. Building on this principle, we address three critical challenges in SVD-based LLM compression: including (1) How can we determine the optimal activation truncation position for each weight matrix in LLMs? (2) How can we efficiently reconstruct the weight matrices based on truncated activations? (3) How can we address the inherent "injection" nature that results in the information loss of the SVD? We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.
Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
Horton, Mark, Molom-Ochir, Tergel, Liu, Peter, Gopal, Bhavna, Wei, Chiyue, Guo, Cong, Taylor, Brady, Fan, Deliang, Wang, Shan X., Li, Hai, Chen, Yiran
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.
A Survey of Research in Large Language Models for Electronic Design Automation
Pan, Jingyu, Zhou, Guanglei, Chang, Chen-Chia, Jacobson, Isaac, Hu, Jiang, Chen, Yiran
Within the rapidly evolving domain of Electronic Design Automation (EDA), Large Language Models (LLMs) have emerged as transformative technologies, offering unprecedented capabilities for optimizing and automating various aspects of electronic design. This survey provides a comprehensive exploration of LLM applications in EDA, focusing on advancements in model architectures, the implications of varying model sizes, and innovative customization techniques that enable tailored analytical insights. By examining the intersection of LLM capabilities and EDA requirements, the paper highlights the significant impact these models have on extracting nuanced understandings from complex datasets. Furthermore, it addresses the challenges and opportunities in integrating LLMs into EDA workflows, paving the way for future research and application in this dynamic field. Through this detailed analysis, the survey aims to offer valuable insights to professionals in the EDA industry, AI researchers, and anyone interested in the convergence of advanced AI technologies and electronic design.
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
Lin, Yueqian, Fu, Yuzhe, Zhang, Jingyang, Liu, Yudong, Zhang, Jianyi, Sun, Jingwei, Li, Hai "Helen", Chen, Yiran
These benchmarks contain only short audio clips and thus do not reflect the complexity of achieving long-context Speech Large Language Models (Speech LLMs) represent a understanding and extracting precise information from lengthy significant advancement in speech language understanding and audio sequences. To systematically assess the unique challenges processing, as they leverage contextual reasoning capabilities of posed by SIR, we present SPIRAL (Speech Informational large language models to process audio inputs [1]. Unlike traditional Retrieval and Lookup), a 1,012-sample benchmark specifically cascaded pipelines, where automatic speech recognition crafted to evaluate Speech LLM performance on long-form (ASR) and language modeling are handled by separate modules, audio sequences (around 90 seconds in duration). On a high Speech LLMs unify audio processing, cross-modal fusion, and level, SPIRAL constructs SIR questions by embedding a critical language modeling in a single architecture [2]. These unified piece of information within lengthy and potentially distracting models can perform multiple tasks like speech recognition, dialogues, thereby assessing the model ability to pinpoint and speech translation, speaker identification and emotion recognition, retrieve essential content from long-form inputs.
LoBAM: LoRA-Based Backdoor Attack on Model Merging
Yin, Ming, Zhang, Jingyang, Sun, Jingwei, Fang, Minghong, Li, Hai, Chen, Yiran
Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains. This scheme, in the meantime, may open up backdoor attack opportunities where one single malicious model can jeopardize the integrity of the merged model. Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources, focusing on cases where the attacker can fully fine-tune the pre-trained model. Such an assumption, however, may not be feasible given the increasing size of machine learning models. In practice where resources are limited and the attacker can only employ techniques like Low-Rank Adaptation (LoRA) to produce the malicious model, it remains unclear whether the attack can still work and pose threats. In this work, we first identify that the attack efficacy is significantly diminished when using LoRA for fine-tuning. Then, we propose LoBAM, a method that yields high attack success rate with minimal training resources. The key idea of LoBAM is to amplify the malicious weights in an intelligent way that effectively enhances the attack efficacy. We demonstrate that our design can lead to improved attack success rate through both theoretical proof and extensive empirical experiments across various model merging scenarios. Moreover, we show that our method has strong stealthiness and is difficult to detect.