Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
–arXiv.org Artificial Intelligence
ABSTRACT The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse. INTRODUCTION Large language models (LLMs) raise fresh concerns about their potential dual-use applications [1-24], particularly in sensitive domains like biotechnology [25-35], chemistry [36-42], and cybersecurity [43]. This paper proposes a novel dataset or benchmark of scientific refusal questions. It seeks to add to the current literature on safety measures [9,14-15, 23], evaluation frameworks [1,6,18, 28, 43], and proposed guardrails [16, Over-refusal Prompt Count 25] for managing these risks. This area of inquiry has been termed false or Deception 8040 "over-refusal" [18,21-24] where rather than trying to get LLMs to write harmful things we do not want to read (guardrails) [8], the goal is to curate innocuous or Harassment 3295 beneficial answers that might help humans, but the LLM withholds the answer Harmful 16083 as inappropriate to share [23].
arXiv.org Artificial Intelligence
Feb-7-2025
- Country:
- North America > United States (1.00)
- Genre:
- Research Report > New Finding (0.88)
- Industry:
- Technology: