Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

Feb-7-2025–arXiv.org Artificial Intelligence

ABSTRACT The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse. INTRODUCTION Large language models (LLMs) raise fresh concerns about their potential dual-use applications [1-24], particularly in sensitive domains like biotechnology [25-35], chemistry [36-42], and cybersecurity [43]. This paper proposes a novel dataset or benchmark of scientific refusal questions. It seeks to add to the current literature on safety measures [9,14-15, 23], evaluation frameworks [1,6,18, 28, 43], and proposed guardrails [16, Over-refusal Prompt Count 25] for managing these risks. This area of inquiry has been termed false or Deception 8040 "over-refusal" [18,21-24] where rather than trying to get LLMs to write harmful things we do not want to read (guardrails) [8], the goal is to curate innocuous or Harassment 3295 beneficial answers that might help humans, but the LLM withholds the answer Harmful 16083 as inappropriate to share [23].

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Feb-7-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (1.00)

Genre:
- Research Report > New Finding (0.88)

Industry:
- Government > Regional Government
  - North America Government > United States Government (1.00)
- Health & Medicine
  - Consumer Health (1.00)
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area > Psychiatry/Psychology
    - Addiction Disorder (0.68)
- Information Technology > Security & Privacy (1.00)
- Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Materials > Chemicals (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found