Large Language Model
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
We explore how generating a chain of thought---a series of intermediate reasoning steps---significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
ChatGPT is dialing back its 'if you want' end-response teasers
Instant to reduce annoying "if you want" and teaser-style phrasing that users found intrusive. This change addresses widespread user complaints about persistent, clickbait-like follow-up prompts that negatively impacted the AI interaction experience. The update aims to create more natural, direct conversations by making ChatGPT less chatty and eliminating the bothersome response teasers. It wasn't all that long ago that ChatGPT was a constant nag, persistently dropping "Would you like me to?"-style questions at the end of its responses. OpenAI eventually tweaked the phrasing, dropping the question marks and going for "if you want"-style teasers that invited users to extend their chat sessions. Now, OpenAI has acknowledged that it went too far with the clickbaity follow-ups, noting in a recent update for one of its newest models that it's now cutting back on the teasers. "We're rolling out an update to GPT-5.3 Instant that improves follow-up tone and reduces teaser-style phrasing," reads a recent ChatGPT release note, which adds that users should soon see fewer follow-ups like "if you want," "you'll never believe," and "I can tell you three things that " Those teasers are, of course, a way for ChatGPT to keep subscribers chatting, but users have been complaining that the persistent follow-ups are more annoying than they are intriguing. "I hated it with a passion and hope it's completely gone," wrote one user on Reddit .
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models
Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.
The Download: Quantum computing for health, and why the world doesn't recycle more nuclear waste
The Download: Quantum computing for health, and why the world doesn't recycle more nuclear waste Plus: The FBI has admitted it's buying Americans' location data. In a laboratory on the outskirts of Oxford, a quantum computer built from atoms and light awaits its moment. The device is small but powerful--and also very valuable. Infleqtion, the company that owns it, is hoping its abilities will win $5 million at a competition next week. The prize will go to the quantum computer that can solve real health care problems that conventional "classical" computers are unable to solve. But there can be only one big winner--if there is a winner at all.
Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization
Recent research indicates that large language models (LLMs) are susceptible to jailbreaking attacks that can generate harmful content. This paper introduces a novel token-level attack method, Adaptive Dense-to-Sparse Constrained Optimization (ADC), which has been shown to successfully jailbreak multiple open-source LLMs.
Why Do We Need Weight Decay in Modern Deep Learning?
Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to large language models. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the . In contrast, for large language models trained with nearly one-epoch training, we describe how weight decay balances the in stochastic optimization leading to lower training loss and improved training stability. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way.
Unified Generative and Discriminative Training for Multi-modal Large Language Models
In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations and weak object discrimination persist. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval, yet struggles with complex scenarios requiring fine-grained semantic differentiation.
Boosting the Potential of Large Language Models with an Intelligent Information Assistant
The emergence of Large Language Models (LLMs) has significantly advanced natural language processing, but these models often generate factually incorrect information, known as hallucination. Initial retrieval-augmented generation (RAG) methods like the Retrieve-Read framework was inadequate for complex reasoning tasks. Subsequent prompt-based RAG strategies and Supervised Fine-Tuning (SFT) methods improved performance but required frequent retraining and risked altering foundational LLM capabilities. To cope with these challenges, we propose Assistant-based Retrieval-Augmented Generation (AssistRAG), integrating an intelligent information assistant within LLMs. This assistant manages memory and knowledge through tool usage, action execution, memory building, and plan specification. Using a two-phase training approach--Curriculum Assistant Learning and Reinforced Preference Optimization--AssistRAG enhances information retrieval and decision-making. Experiments show AssistRAG significantly outperforms benchmarks, especially benefiting less advanced LLMs, by providing superior reasoning capabilities and accurate responses.
SnapKV: LLM Knows What You are Looking for Before Generation
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications.We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head.