ASTRAL: Automated Safety Testing of Large Language Models

Ugarte, Miriam, Valle, Pablo, Parejo, José Antonio, Segura, Sergio, Arrieta, Aitor

Jan-28-2025–arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e., different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe test inputs, significantly increasing the number of unsafe LLM behaviors.

large language model, machine learning, test input, (16 more...)

arXiv.org Artificial Intelligence

Jan-28-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Maine (0.04)
  - California (0.04)
- Europe
  - Spain
    - Basque Country (0.04)
    - Andalusia > Seville Province
      - Seville (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Government > Regional Government (0.67)
- Health & Medicine
  - Therapeutic Area > Psychiatry/Psychology (1.00)
  - Pharmaceuticals & Biotechnology (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found