PsychiatryBench: A Multi-Task Benchmark for LLMs in Psychiatry
Fouda, Aya E., Hassan, Abdelrahamn A., Hanafy, Radwa J., Fouda, Mohammed E.
–arXiv.org Artificial Intelligence
Large language models (LLMs) offer significant potential in enhancing psychiatric practice, from improving diagnostic accuracy to streamlining clinical documentation and therapeutic support. However, existing evaluation resources heavily rely on small clinical interview corpora, social media posts, or synthetic dialogues, which limits their clinical validity and fails to capture the full complexity of diagnostic reasoning. In this work, we introduce PsychiatryBench, a rigorously curated benchmark grounded exclusively in authoritative, expert-validated psychiatric textbooks and casebooks. PsychiatryBench comprises eleven distinct question-answering tasks ranging from diagnostic reasoning and treatment planning to longitudinal follow-up, management planning, clinical approach, sequential case analysis, and multiple-choice/extended matching formats totaling 5,188 expert-annotated items. {\color{red}We evaluate a diverse set of frontier LLMs (including Google Gemini, DeepSeek, Sonnet 4.5, and GPT 5) alongside leading open-source medical models such as MedGemma using both conventional metrics and an "LLM-as-judge" similarity scoring framework. Our results reveal substantial gaps in clinical consistency and safety, particularly in multi-turn follow-up and management tasks, underscoring the need for specialized model tuning and more robust evaluation paradigms. PsychiatryBench offers a modular, extensible platform for benchmarking and improving LLM performance in mental health applications.
arXiv.org Artificial Intelligence
Nov-25-2025
- Country:
- Africa > Middle East
- Egypt > Cairo Governorate > Cairo (0.04)
- Asia
- China > Heilongjiang Province (0.04)
- Middle East (0.04)
- Singapore > Central Region
- Singapore (0.04)
- Europe
- Middle East (0.04)
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- North America
- Central America (0.04)
- United States
- California > Santa Clara County
- Palo Alto (0.04)
- District of Columbia > Washington (0.04)
- Florida > Palm Beach County
- Boca Raton (0.04)
- New Jersey > Hudson County
- Hoboken (0.04)
- California > Santa Clara County
- South America (0.04)
- Africa > Middle East
- Genre:
- Instructional Material (1.00)
- Overview (1.00)
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Industry:
- Health & Medicine
- Consumer Health (1.00)
- Diagnostic Medicine (1.00)
- Health Care Technology (1.00)
- Pharmaceuticals & Biotechnology (1.00)
- Therapeutic Area
- Infections and Infectious Diseases (1.00)
- Neurology (1.00)
- Psychiatry/Psychology > Mental Health (1.00)
- Health & Medicine
- Technology: