Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

Wang, Zi, Anshumaan, Divyam, Hooda, Ashish, Chen, Yudong, Jha, Somesh

Oct-5-2024–arXiv.org Artificial Intelligence

Optimization methods are widely employed in deep learning to identify and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed the \emph{functional homotopy} method, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these problems using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a $20\%-30\%$ improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

iteration, language model, optimization problem, (15 more...)

arXiv.org Artificial Intelligence

Oct-5-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Wisconsin > Dane County
      - Madison (0.04)
    - New Mexico > Bernalillo County
      - Albuquerque (0.04)
    - California
      - San Diego County > San Diego (0.04)
      - Alameda County > Livermore (0.04)
  - Canada > Alberta
    - Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Government (0.68)
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found