Jha, Somesh
LLM-Driven Multi-step Translation from C to Rust using Static Analysis
Zhou, Tianyang, Lin, Haowen, Jha, Somesh, Christodorescu, Mihai, Levchenko, Kirill, Chandrasekaran, Varun
Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety while maintaining high performance. However, manual translation is cumbersome, error-prone, and produces unidiomatic code. Large language models (LLMs) have demonstrated promise in producing idiomatic translations, but offer no correctness guarantees as they lack the ability to capture all the semantics differences between the source and target languages. To resolve this issue, we propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology: an "unidiomatic" step to translate C into Rust while preserving semantics, and an "idiomatic" step to refine the code to follow Rust's semantic standards. SACTOR utilizes information provided by static analysis of the source C program to address challenges such as pointer semantics and dependency resolution. To validate the correctness of the translated result from each step, we use end-to-end testing via the foreign function interface to embed our translated code segment into the original code. We evaluate the translation of 200 programs from two datasets and two case studies, comparing the performance of GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Llama 3.3 70B and DeepSeek-R1 in SACTOR. Our results demonstrate that SACTOR achieves high correctness and improved idiomaticity, with the best-performing model (DeepSeek-R1) reaching 93% and (GPT-4o, Claude 3.5, DeepSeek-R1) reaching 84% correctness (on each dataset, respectively), while producing more natural and Rust-compliant translations compared to existing methods.
PEA: Enhancing LLM Performance on Computational-Reasoning Tasks
Wang, Zi, Weng, Shiwei, Alhanahnah, Mohannad, Jha, Somesh, Reps, Tom
Large Language Models (LLMs) have exhibited significant generalization capabilities across diverse domains, prompting investigations into their potential as generic reasoning engines. Recent studies have explored inference-time computation techniques [Welleck et al., 2024, Snell et al., 2024], particularly prompt engineering methods such as Chain-of-Thought (CoT), to enhance LLM performance on complex reasoning tasks [Wei et al., 2022]. These approaches have successfully improved model performance and expanded LLMs' practical applications. However, despite the growing focus on enhancing model capabilities through inference-time computation for complex reasoning tasks, the current literature lacks a formal framework to precisely describe and characterize the complexity of reasoning problems. This study identifies a class of reasoning problems, termed computational reasoning problems, which are particularly challenging for LLMs [Yao et al., 2023, Hao et al., 2024, Valmeekam et al., 2023], such as planning problems and arithmetic games. Informally, these problems can be accurately described using succinct programmatic representations. We propose a formal framework to describe and algorithmically solve these problems. The framework employs first-order logic, equipped with efficiently computable predicates and finite domains.
SLVR: Securely Leveraging Client Validation for Robust Federated Learning
Choi, Jihye, Rachuri, Sai Rahul, Wang, Ke, Jha, Somesh, Wang, Yizhen
Federated Learning (FL) enables collaborative model training while keeping client data private. However, exposing individual client updates makes FL vulnerable to reconstruction attacks. Secure aggregation mitigates such privacy risks but prevents the server from verifying the validity of each client update, creating a privacy-robustness tradeoff. Recent efforts attempt to address this tradeoff by enforcing checks on client updates using zero-knowledge proofs, but they support limited predicates and often depend on public validation data. We propose SLVR, a general framework that securely leverages clients' private data through secure multi-party computation. By utilizing clients' data, SLVR not only eliminates the need for public validation data, but also enables a wider range of checks for robustness, including cross-client accuracy validation. It also adapts naturally to distribution shifts in client data as it can securely refresh its validation data up-to-date. Our empirical evaluations show that SLVR improves robustness against model poisoning attacks, particularly outperforming existing methods by up to 50% under adaptive attacks. Additionally, SLVR demonstrates effective adaptability and stable convergence under various distribution shift scenarios.
On the Difficulty of Constructing a Robust and Publicly-Detectable Watermark
Fairoze, Jaiden, Ortiz-Jimรฉnez, Guillermo, Vecerik, Mel, Jha, Somesh, Gowal, Sven
This work investigates the theoretical boundaries of creating publicly-detectable schemes to enable the provenance of watermarked imagery. Metadata-based approaches like C2PA provide unforgeability and public-detectability. ML techniques offer robust retrieval and watermarking. However, no existing scheme combines robustness, unforgeability, and public-detectability. In this work, we formally define such a scheme and establish its existence. Although theoretically possible, we find that at present, it is intractable to build certain components of our scheme without a leap in deep learning capabilities. We analyze these limitations and propose research directions that need to be addressed before we can practically realize robust and publicly-verifiable provenance.
SoK: Watermarking for AI-Generated Content
Zhao, Xuandong, Gunn, Sam, Christ, Miranda, Fairoze, Jaiden, Fabrega, Andres, Carlini, Nicholas, Garg, Sanjam, Hong, Sanghyun, Nasr, Milad, Tramer, Florian, Jha, Somesh, Li, Lei, Wang, Yu-Xiang, Song, Dawn
As the outputs of generative AI (GenAI) techniques improve in quality, it becomes increasingly challenging to distinguish them from human-created content. Watermarking schemes are a promising approach to address the problem of distinguishing between AI and human-generated content. These schemes embed hidden signals within AI-generated content to enable reliable detection. While watermarking is not a silver bullet for addressing all risks associated with GenAI, it can play a crucial role in enhancing AI safety and trustworthiness by combating misinformation and deception. This paper presents a comprehensive overview of watermarking techniques for GenAI, beginning with the need for watermarking from historical and regulatory perspectives. We formalize the definitions and desired properties of watermarking schemes and examine the key objectives and threat models for existing approaches. Practical evaluation strategies are also explored, providing insights into the development of robust watermarking techniques capable of resisting various attacks. Additionally, we review recent representative works, highlight open challenges, and discuss potential directions for this emerging field. By offering a thorough understanding of watermarking in GenAI, this work aims to guide researchers in advancing watermarking methods and applications, and support policymakers in addressing the broader implications of GenAI.
Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts
Choi, Jihye, Raghuram, Jayaram, Li, Yixuan, Jha, Somesh
Advancements in foundation models (FMs) have led to a paradigm shift in machine learning. The rich, expressive feature representations from these pre-trained, large-scale FMs are leveraged for multiple downstream tasks, usually via lightweight fine-tuning of a shallow fully-connected network following the representation. However, the non-interpretable, black-box nature of this prediction pipeline can be a challenge, especially in critical domains such as healthcare, finance, and security. In this paper, we explore the potential of Concept Bottleneck Models (CBMs) for transforming complex, non-interpretable foundation models into interpretable decision-making pipelines using high-level concept vectors. Specifically, we focus on the test-time deployment of such an interpretable CBM pipeline "in the wild", where the input distribution often shifts from the original training distribution. We first identify the potential failure modes of such a pipeline under different types of distribution shifts. Then we propose an adaptive concept bottleneck framework to address these failure modes, that dynamically adapts the concept-vector bank and the prediction layer based solely on unlabeled data from the target domain, without access to the source (training) dataset. Empirical evaluations with various real-world distribution shifts show that our adaptation method produces concept-based interpretations better aligned with the test data and boosts post-deployment accuracy by up to 28%, aligning the CBM performance with that of non-interpretable classification.
Defending Object Detectors against Patch Attacks with Out-of-Distribution Smoothing
Feng, Ryan, Mangaokar, Neal, Choi, Jihye, Jha, Somesh, Prakash, Atul
Machine learning models today remain vulnerable to adversarial examples [11, 27, 1, 2, 9, 10, 29], where perturbed inputs lead to unexpected model outputs. Such adversarial examples take a variety of forms, including digital attacks [11, 27] and physical [9, 2, 10] attacks, where the attack can be physically-realized in the real-world in the form of printed stickers [9, 10] or 3D objects [2]. Thus, the patch attack has been of increasing interest due to its ability to practically inject an attack via the insertion of a printed physical patch into the scene. A variety of patch attacks defenses have thus been proposed, including several certified [15, 5, 33, 32, 34, 19] and empirical [35, 20, 16, 37, 28, 4] defenses, with many of these defenses designed around the operation of identifying and then removing the patch. Such defenses rely on being able to accurately identify the patch attack without false positives and remove the effects of identified patches with a variety of techniques, including blacking them out [20] or setting it to the image's mean color [35]. Our first key contribution is that we unify these types of defenses under a general framework called OODSmoother (Section 3), as shown in Figure 1.
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Liu, Xiaogeng, Li, Peiran, Suh, Edward, Vorobeychik, Yevgeniy, Mao, Zhuoqing, Jha, Somesh, McDaniel, Patrick, Sun, Huan, Li, Bo, Xiao, Chaowei
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Wang, Zi, Anshumaan, Divyam, Hooda, Ashish, Chen, Yudong, Jha, Somesh
Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Palumbo, Nils, Mangal, Ravi, Wang, Zifan, Vijayakumar, Saranya, Pasareanu, Corina S., Jha, Somesh
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We use these axioms to guide the mechanistic interpretability analysis of a Transformer-based model trained to solve the well-known 2-SAT problem. We are able to reverse engineer the algorithm learned by the model -- the model first parses the input formulas and then evaluates their satisfiability via enumeration of different possible valuations of the Boolean input variables. We also present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.