Large Language Model
ChatGPT can be hijacked without you knowing. Lockdown Mode is the fix
PCWorld reports that OpenAI launched Lockdown Mode for ChatGPT to combat prompt injection attacks that can hijack AI systems and steal personal information. These attacks have previously compromised AI browsers like Perplexity and controlled smart home devices through Google Gemini by tricking systems with malicious instructions. Lockdown Mode restricts features like live web browsing and Deep Research across all ChatGPT plans, though OpenAI acknowledges risks from uploaded files remain. OpenAI has launched a new security feature in ChatGPT called Lockdown Mode, designed to provide additional protection against so-called "prompt injection attacks." A prompt injection attack is when someone crafts a deceptive prompt in an attempt to trick the LLM into following malicious instructions and/or revealing sensitive information.
Quadratic Coreset Selection: Certifying and Reconciling Sequence and Token Mining for Efficient Instruction Tuning
Instruction-Tuning (IT) was recently found the impressive data efficiency in post-training large language models (LLMs). While the pursuit of efficiency predominantly focuses on sequence-level curation, often overlooking the nuanced impact of critical tokens and the inherent risks of token noise and biases. Drawing inspiration from bi-level coreset selection, our work provides the principled view of the motivation behind selecting instructions' responses. It leads to our approach Quadratic Coreset Selection (QCS) that reconciles sequence-level and token-level influence contributions, deriving more expressive LLMs with established theoretical result. Despite the original QCS framework challenged by prohibitive computation from inverted LLM-scale Hessian matrices, we overcome this barrier by proposing a novel QCS probabilistic variant, which relaxes the original formulation through re-parameterized densities. This innovative solver is efficiently learned using hierarchical policy gradients without requiring back-propagation, achieving provable convergence and certified asymptotic equivalence to the original objective. Our experiments demonstrate QCS's superior sequence-level data efficiency and reveal how strategically leveraging token-level influence elevates the performance ceiling of data-efficient IT. Furthermore, QCS's adaptability is showcased through its successes in regular IT and challenging targeted IT scenarios, particularly in the cases of free-form complex instruction-following and CoT reasoning. They underscore QCS's potential for a wide array of versatile post-training applications.
Let LRMs Break Free from Overthinking via Self-Braking Tuning
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions.
My A/C unit came with a cruddy manual. Claude made a better one
PCWorld reports how Claude AI transformed a confusing air conditioner manual into a comprehensive 12-page guide with visuals and maintenance tips. The process involved uploading the generic manual and model number to Claude's Cowork feature, which generated accurate operating procedures and quick-start guides. This demonstrates AI's potential to make complex product documentation more user-friendly and accessible for consumers struggling with manufacturer manuals. Um, what does button do? Our new air conditioner had just arrived, a necessity for a sure-to-be-sizzling New York summer, and already I was scratching my head.
DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text.
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning
Maintaining consistent model performance across domains is a fundamental challenge in machine learning. While recent work has explored using LLM-generated data for fine-tuning, its impact on cross-domain generalization remains poorly understood. This paper presents a systematic analysis revealing that fine-tuning with LLM-generated data not only improves target task performance but also reduces non-target task degradation compared to fine-tuning with ground truth data. Through analyzing the data sequence in tasks of various domains, we demonstrate that this enhancement of non-target task robustness stems from the reduction of high perplexity tokens found in LLM-generated sequences. Following our findings, we showed that masking high perplexity tokens in ground truth training data achieves similar non-target task performance preservation, comparable to using LLM-generated data. Extensive experiments across different model families and scales, including Gemma 2 IT 2B, Llama 3 8B Instruct, and three additional models, agree with our findings. To the best of our knowledge, this is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning, offering valuable insights for developing more robust fine-tuning strategies.
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 85.34% higher average refusal triggering rate across 9 LLMs without a safety-prior system prompt, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Despite the increasing demand for unlearning, a technically-grounded optimization framework is lacking. Gradient ascent (GA)-type methods, though widely used, are suboptimal as they reverse the learning process without controlling optimization divergence (i.e., deviation from the pre-trained state), leading to risks of model collapse. Negative preference optimization (NPO) has been proposed to address this issue and is considered one of the state-of-the-art LLM unlearning approaches. In this work, we revisit NPO and identify another critical issue: reference model bias. This bias arises from using the reference model (i.e., the model prior to unlearning) to assess unlearning success, which can lead to a misleading impression of the true data-wise unlearning effectiveness. Specifically, it could cause (a) uneven allocation of optimization power across forget data with varying difficulty levels, and (b) ineffective gradient weight smoothing during the early stages of unlearning optimization. To overcome these challenges, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that simplicity--removing the reliance on a reference model (through the lens of simple preference optimization)--benefits unlearning. We provide deeper insights into SimNPO's advantages, including an analysis based on mixtures of Markov chains.
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models
Large language models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their exponentially increasing parameters pose significant challenges for deployment on resource-constrained devices. Vector Quantization (VQ) shows great promise for low-bit quantization (e.g., 2 to 4 bits), but existing work faces two key challenges: unconstrained direction error and suboptimal bit allocation. In this paper, we propose RSAVQ, a novel VQ framework to enhance extremely low-bit quantization for LLMs. RSAVQ introduces two geometry-driven innovations that effectively mitigate above limitations: (1) Error Direction Sensitivity Guidance (EDSG), which leverages the Fisher information matrix (FIM)-induced Riemannian metric to project quantization errors onto low-sensitivity directions in the parameter space. Specifically, this projection is performed along the negative natural gradient direction, which effectively suppresses error expansion.
Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
Achieving better alignment between vision embeddings and Large Language Models (LLMs) is crucial for enhancing the abilities of Multimodal LLMs (MLLMs), particularly for recent models that rely on powerful pretrained vision encoders and LLMs. A common approach to connect the pretrained vision encoder and LLM is through a projector applied after the vision encoder. However, the projector is often trained to enable the LLM to generate captions, and hence the mechanism by which LLMs understand each vision token remains unclear. In this work, we first investigate the role of the projector in compressing vision embeddings and aligning them with word embeddings. We show that the projector significantly compresses visual information, removing redundant details while preserving essential elements necessary for the LLM to understand visual content.