Goto

Collaborating Authors

 Large Language Model


Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

arXiv.org Machine Learning

Adversarial training (AT) is an effective defense for large language models (LLMs) against jailbreak attacks, but performing AT on LLMs is costly. To improve the efficiency of AT for LLMs, recent studies propose continuous AT (CAT) that searches for adversarial inputs within the continuous embedding space of LLMs during AT. While CAT has achieved empirical success, its underlying mechanism, i.e., why adversarial perturbations in the embedding space can help LLMs defend against jailbreak prompts synthesized in the input token space, remains unknown. This paper presents the first theoretical analysis of CAT on LLMs based on in-context learning (ICL) theory. For linear transformers trained with adversarial examples from the embedding space on in-context linear regression tasks, we prove a robust generalization bound that has a negative correlation with the perturbation radius in the embedding space. This clearly explains why CAT can defend against jailbreak prompts from the LLM's token space. Further, the robust bound shows that the robustness of an adversarially trained LLM is closely related to the singular values of its embedding matrix. Based on this, we propose to improve LLM CAT by introducing an additional regularization term, which depends on singular values of the LLM's embedding matrix, into the objective function of CAT. Experiments on real-world LLMs demonstrate that our method can help LLMs achieve a better jailbreak robustness-utility tradeoff. The code is available at https://github.com/fshp971/continuous-adv-icl.


Adaptive Budget Allocation in LLM-Augmented Surveys

arXiv.org Machine Learning

Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.


The Strange Origin of AI's 'Reasoning' Abilities

The Atlantic - Technology

It involves 4chan, of all places. In July 2020, 4chan's video-game discussion board looked much like the rest of the notorious online forum. There were elaborate, libidinal fantasies involving "whores" and "dragon cum," and comments on how long a gamer had to wait "before my dick can get up for another beating," as one put it. And yet, as the gamers discussed such things, they were also making a discovery of significance to the AI industry. Some of them were playing, a new text-based role-playing game that was essentially an AI version of .


ADD for Multi-Bit Image Watermarking

arXiv.org Machine Learning

As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100\% decoding accuracy, with performance dropping by at most 2\% under a wide range of image distortions, substantially smaller than the 14\% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.


Post-Selection Distributional Model Evaluation

arXiv.org Machine Learning

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and is proved to be more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.


AI companies know they have an image problem. Will funding policy papers and thinktanks dig them out?

The Guardian

OpenAI logo is seen in this illustration taken on 20 May 2024. OpenAI logo is seen in this illustration taken on 20 May 2024. AI companies know they have an image problem. OpenAI made a surprise announcement this week - not an update to ChatGPT or another multibillion-dollar datacenter - but a policy paper that called for a reimagining of the social contract based around "a slate of people-first ideas". It's the latest move in an aggressive effort by the major AI players to reshape the narrative around their industry, as polls show public disapproval of AI increasing.


Americans 'creeped out' as ChatGPT starts inserting Arabic words into responses... before giving strange explanation

Daily Mail - Science & tech

Ritzy Bay Area town torn apart after teacher's daughter, 16, was behind wheel when four friends died in high-speed crash... then she posted a TikTok video that poured fuel on the flames Two CIA officers killed in Mexico when their car skidded off ravine and exploded after meeting about bust of'largest ever drug lab' Insiders claim failed AI rollout could be to blame for Tim Cook's departure from Apple - as one says'the AI era requires a different kind of leadership' Trump confronts Xi as US forces seize Chinese ship carrying mysterious'gift' to Iran New'Hollywood dose' pill: A-listers hooked on'youth elixir' that dermatologists say is anti-ageing, shrinks pores, smooths wrinkles... and even banishes rosacea Days after we got engaged, the love of my life told me he'd killed a man and buried him in a bog. I reported him to police... but then I made this irreversible mistake Life-threatening cantaloupe recall in four states upgraded to FDA's highest risk level... 'reasonable probability of death' Fury as murderer marries pen pal behind bars... as teenage victim's mom says: 'I'm serving a life sentence without my son' Kate and William join Charles and Camilla in celebrating British centenarians at Buckingham Palace as Royal Family marks the late Queen's 100th birthday US troops board second tanker as Trump accuses Iran of violating ceasefire'numerous times' - Live updates AMANDA PLATELL: Why Sarah Ferguson - with the ghost of Princess Diana at her side - is ready to sensationally blow up the Royal Family. She knows ALL their secrets... New Jersey man's chilling'cancer map' fuels fears of poisoned neighborhood with 41 cases and counting How to lose weight when perimenopause sabotages your metabolism: I'm a trainer but when I hit 46, I piled on the pounds overnight. I was losing hair so fast a bald spot the size of an orange appeared. I owe my life to a $1 at-home treatment that REVERSED the damage in a month.


Synthetic Data for any Differentiable Target

arXiv.org Machine Learning

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.


Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

arXiv.org Machine Learning

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.


Claude Mythos Is Everyone's Problem

The Atlantic - Technology

What happens when AI can hack everything? For the past several weeks, Anthropic says it secretly possessed a tool potentially capable of commandeering most computer servers in the world. This is a bot that, if unleashed, might be able to hack into banks, exfiltrate state secrets, and fry crucial infrastructure. Already, according to the company, this AI model has identified thousands of major cybersecurity vulnerabilities--including exploits in every single major operating system and browser. This level of cyberattack is typically available only to elite, state-sponsored hacking cells in a very small number of countries including China, Russia, and the United States.