AITopics | Tramer, Florian

Collaborating Authors

Tramer, Florian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exploring and Mitigating Adversarial Manipulation of Voting-Based Leaderboards

Huang, Yangsibo, Nasr, Milad, Angelopoulos, Anastasios, Carlini, Nicholas, Chiang, Wei-Lin, Choquette-Choo, Christopher A., Ippolito, Daphne, Jagielski, Matthew, Lee, Katherine, Liu, Ken Ziyu, Stoica, Ion, Tramer, Florian, Zhang, Chiyuan

arXiv.org Artificial IntelligenceJan-13-2025

It is now common to evaluate Large Language Models (LLMs) by having humans manually vote to evaluate model outputs, in contrast to typical benchmarks that evaluate knowledge or skill at some particular task. Chatbot Arena, the most popular benchmark of this type, ranks models by asking users to select the better response between two randomly selected models (without revealing which model was responsible for the generations). These platforms are widely trusted as a fair and accurate measure of LLM capabilities. In this paper, we show that if bot protection and other defenses are not implemented, these voting-based benchmarks are potentially vulnerable to adversarial manipulation. Specifically, we show that an attacker can alter the leaderboard (to promote their favorite model or demote competitors) at the cost of roughly a thousand votes (verified in a simulated, offline version of Chatbot Arena). Our attack consists of two steps: first, we show how an attacker can determine which model was used to generate a given reply with more than $95\%$ accuracy; and then, the attacker can use this information to consistently vote for (or against) a target model. Working with the Chatbot Arena developers, we identify, propose, and implement mitigations to improve the robustness of Chatbot Arena against adversarial manipulation, which, based on our analysis, substantially increases the cost of such attacks. Some of these defenses were present before our collaboration, such as bot protection with Cloudflare, malicious user detection, and rate limiting. Others, including reCAPTCHA and login are being integrated to strengthen the security in Chatbot Arena.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.07493

Country: Europe (1.00)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SoK: Watermarking for AI-Generated Content

Zhao, Xuandong, Gunn, Sam, Christ, Miranda, Fairoze, Jaiden, Fabrega, Andres, Carlini, Nicholas, Garg, Sanjam, Hong, Sanghyun, Nasr, Milad, Tramer, Florian, Jha, Somesh, Li, Lei, Wang, Yu-Xiang, Song, Dawn

arXiv.org Artificial IntelligenceDec-19-2024

As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not a silver bullet for addressing all risks associated with GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by combating misinformation and deception. This paper presents a comprehensive overview of watermarking techniques for GenAI, beginning with the need for watermarking from historical and regulatory perspectives. We formalize the definitions and desired properties of watermarking schemes and examine the key objectives and threat models for existing approaches. Practical evaluation strategies are also explored, providing insights into the development of robust watermarking techniques capable of resisting various attacks. Additionally, we review recent representative works, highlight open challenges, and discuss potential directions for this emerging field. By offering a thorough understanding of watermarking in GenAI, this work aims to guide researchers in advancing watermarking methods and applications, and support policymakers in addressing the broader implications of GenAI.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.18479

Country:

Europe (1.00)
North America > United States > California (0.46)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

Add feedback

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Chao, Patrick, Debenedetti, Edoardo, Robey, Alexander, Andriushchenko, Maksym, Croce, Francesco, Sehwag, Vikash, Dobriban, Edgar, Flammarion, Nicolas, Pappas, George J., Tramer, Florian, Hassani, Hamed, Wong, Eric

arXiv.org Artificial IntelligenceJul-16-2024

Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2404.01318

Country: North America > United States (0.67)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Anwar, Usman, Saparov, Abulhair, Rando, Javier, Paleka, Daniel, Turpin, Miles, Hase, Peter, Lubana, Ekdeep Singh, Jenner, Erik, Casper, Stephen, Sourbut, Oliver, Edelman, Benjamin L., Zhang, Zhaowei, Günther, Mario, Korinek, Anton, Hernandez-Orallo, Jose, Hammond, Lewis, Bigelow, Eric, Pan, Alexander, Langosco, Lauro, Korbak, Tomasz, Zhang, Heidi, Zhong, Ruiqi, hÉigeartaigh, Seán Ó, Recchia, Gabriel, Corsi, Giulio, Chan, Alan, Anderljung, Markus, Edwards, Lilian, Bengio, Yoshua, Chen, Danqi, Albanie, Samuel, Maharaj, Tegan, Foerster, Jakob, Tramer, Florian, He, He, Kasirzadeh, Atoosa, Choi, Yejin, Krueger, David

arXiv.org Artificial IntelligenceApr-15-2024

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.

cooperation and disincentivizing high-risk approach, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2404.09932

Country:

Asia > Middle East (0.92)
North America > United States > California (0.92)
South America (0.92)
(5 more...)

Genre:

Research Report > Promising Solution (1.00)
Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Social Sector (1.00)
Leisure & Entertainment > Games (1.00)
Law > Civil Rights & Constitutional Law (1.00)
(12 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (0.67)

Add feedback

Are aligned neural networks adversarially aligned?

Carlini, Nicholas, Nasr, Milad, Choquette-Choo, Christopher A., Jagielski, Matthew, Gao, Irena, Awadalla, Anas, Koh, Pang Wei, Ippolito, Daphne, Lee, Katherine, Tramer, Florian, Schmidt, Ludwig

arXiv.org Artificial IntelligenceJun-26-2023

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2306.15447

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report (0.82)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)

Add feedback

Quantifying Memorization Across Neural Language Models

Carlini, Nicholas, Ippolito, Daphne, Jagielski, Matthew, Lee, Katherine, Tramer, Florian, Zhang, Chiyuan

arXiv.org Artificial IntelligenceMar-6-2023

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes more complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.

artificial intelligence, machine learning, memorization, (17 more...)

arXiv.org Artificial Intelligence

2202.07646

Country:

North America (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Memory-Based Learning > Rote Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

(Certified!!) Adversarial Robustness for Free!

Carlini, Nicholas, Tramer, Florian, Dvijotham, Krishnamurthy Dj, Rice, Leslie, Sun, Mingjie, Kolter, J. Zico

arXiv.org Artificial IntelligenceMar-6-2023

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. 2020 by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2-norm of 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.

artificial intelligence, diffusion model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2206.1055

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Label-Only Membership Inference Attacks

Choo, Christopher A. Choquette, Tramer, Florian, Carlini, Nicholas, Papernot, Nicolas

arXiv.org Machine LearningJul-28-2020

Membership inference attacks are one of the simplest forms of privacy leakage for machine learning models: given a data point and model, determine whether the point was used to train the model. Existing membership inference attacks exploit models' abnormal confidence when queried on their training data. These attacks do not apply if the adversary only gets access to models' predicted labels, without a confidence measure. In this paper, we introduce label-only membership inference attacks. Instead of relying on confidence scores, our attacks evaluate the robustness of a model's predicted labels under perturbations to obtain a fine-grained membership signal. These perturbations include common data augmentations or adversarial examples. We empirically show that our label-only membership inference attacks perform on par with prior attacks that required access to model confidences. We further demonstrate that label-only attacks break multiple defenses against membership inference attacks that (implicitly or explicitly) rely on a phenomenon we call confidence masking. These defenses modify a model's confidence scores in order to thwart attacks, but leave the model's predicted labels unchanged. Our label-only attacks demonstrate that confidence-masking is not a viable defense strategy against membership inference. Finally, we investigate worst-case label-only attacks, that infer membership for a small number of outlier data points. We show that label-only attacks also match confidence-based attacks in this setting. We find that training models with differential privacy and (strong) L2 regularization are the only known defense strategies that successfully prevents all attacks. This remains true even when the differential privacy budget is too high to offer meaningful provable guarantees.

artificial intelligence, health & medicine, membership inference attack, (17 more...)

arXiv.org Machine Learning

2007.14321

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Tramer, Florian, Boneh, Dan

arXiv.org Machine LearningJun-8-2018

As Machine Learning (ML) gets applied to security-critical or sensitive domains, there is a growing need for integrity and privacy guarantees for ML computations running in untrusted environments. A pragmatic solution comes from Trusted Execution Environments, which use hardware and software protections to isolate sensitive computations from the untrusted software stack. However, these isolation guarantees come at a price in performance, compared to untrusted alternatives. This paper initiates the study of high performance execution of Deep Neural Networks (DNNs) in trusted environments by efficiently partitioning computations between trusted and untrusted devices. Building upon a simple secure outsourcing scheme for matrix multiplication, we propose Slalom, a framework that outsources execution of all linear layers in a DNN from any trusted environment (e.g., SGX, TrustZone or Sanctum) to a faster co-located device. We evaluate Slalom by executing DNNs in an Intel SGX enclave, which selectively outsources work to an untrusted GPU. For two canonical DNNs, VGG16 and MobileNet, we obtain 20x and 6x increases in throughput for verifiable inference, and 10x and 3.5x for verifiable and private inference.

convolution, deep learning, neural network, (20 more...)

arXiv.org Machine Learning

1806.03287

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback