CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

Yu, Erxin, Li, Jing, Liao, Ming, Wang, Siqi, Gao, Zuchen, Mi, Fei, Hong, Lanqing

Jun-25-2024–arXiv.org Artificial Intelligence

As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

category, computational linguistic, coreference, (14 more...)

arXiv.org Artificial Intelligence

Jun-25-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Maryland > Baltimore (0.04)
    - Washington > King County
      - Seattle (0.04)
    - Texas > Travis County
      - Austin (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - China > Hong Kong (0.04)
  - Middle East
    - Israel (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)

Genre:
- Research Report (1.00)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine
  - Therapeutic Area > Psychiatry/Psychology (0.75)
  - Consumer Health (0.72)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found