Goto

Collaborating Authors

 psychiatrist


Psychiatry has finally found an objective way to spot mental illness

New Scientist

"It seems like this past week has been quite challenging for you," a disembodied voice tells me, before proceeding to ask a series of increasingly personal questions. "Have you been feeling down or depressed?" "Can you describe what this feeling has been like for you?" "Does the feeling lift at all when something good happens?" When I respond to each one, my chatbot interviewer thanks me for my honesty and empathises with any issues. By the end of the conversation, I will have also spoken about my sleep patterns, sex drive and appetite for food.


The Chatbot-Delusion Crisis

The Atlantic - Technology

Researchers are scrambling to figure out why generative AI appears to lead some people to a state of "psychosis." Listen to more stories on the Noa app. Chatbots are marketed as great companions, able to answer any question at any time. They're not just tools, but confidants; they do your homework, write love notes, and, as one recent lawsuit against OpenAI details, might readily answer 1,460 messages from the same manic user in a 48-hour period. Jacob Irwin, a 30-year-old cybersecurity professional who says he has no previous history of psychiatric incidents, is suing the tech company, alleging that ChatGPT sparked a "delusional disorder" that led to his extended hospitalization.


Ask WhAI:Probing Belief Formation in Role-Primed LLM Agents

Moore, Keith, Kim, Jun W., Lyu, David, Heo, Jeffrey, Adeli, Ehsan

arXiv.org Artificial Intelligence

We present Ask WhAI, a systems-level framework for inspecting and perturbing belief states in multi-agent interactions. The framework records and replays agent interactions, supports out-of-band queries into each agent's beliefs and rationale, and enables counterfactual evidence injection to test how belief structures respond to new information. We apply the framework to a medical case simulator notable for its multi-agent shared memory (a time-stamped electronic medical record, or EMR) and an oracle agent (the LabAgent) that holds ground truth lab results revealed only when explicitly queried. We stress-test the system on a multi-specialty diagnostic journey for a child with an abrupt-onset neuropsychiatric presentation. Large language model agents, each primed with strong role-specific priors ("act like a neurologist", "act like an infectious disease specialist"), write to a shared medical record and interact with a moderator across sequential or parallel encounters. Breakpoints at key diagnostic moments enable pre- and post-event belief queries, allowing us to distinguish entrenched priors from reasoning or evidence-integration effects. The simulation reveals that agent beliefs often mirror real-world disciplinary stances, including overreliance on canonical studies and resistance to counterevidence, and that these beliefs can be traced and interrogated in ways not possible with human experts. By making such dynamics visible and testable, Ask WhAI offers a reproducible way to study belief formation and epistemic silos in multi-agent scientific reasoning.


I wanted ChatGPT to help me. So why did it advise me how to kill myself?

BBC News

I wanted ChatGPT to help me. So why did it advise me how to kill myself? Lonely and homesick for a country suffering through war, Viktoria began sharing her worries with ChatGPT. Six months later and in poor mental health, she began discussing suicide - asking the AI bot about a specific place and method to kill herself. Let's assess the place as you asked, ChatGPT told her, without unnecessary sentimentality.


From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity

Wan, Tianxi, Luo, Jiaming, Chen, Siyuan, Lan, Kunyao, Chen, Jianhua, Geng, Haiyang, Wu, Mengyue

arXiv.org Artificial Intelligence

Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards. Through this rigorous process, we construct PsyCoTalk, the first large-scale dialogue dataset supporting comorbidity, containing 3,000 multi-turn diagnostic dialogues validated by psychiatrists. This dataset enhances diagnostic accuracy and treatment planning, offering a valuable resource for psychiatric comorbidity research. Compared to real-world clinical transcripts, PsyCoTalk exhibits high structural and linguistic fidelity in terms of dialogue length, token distribution, and diagnostic reasoning strategies. Licensed psychiatrists confirm the realism and diagnostic validity of the dialogues. This dataset enables the development and evaluation of models capable of multi-disorder psychiatric screening in a single conversational pass.


AI Psychosis Is Rarely Psychosis at All

WIRED

A wave of AI users presenting in states of psychological distress gave birth to an unofficial diagnostic label. Experts say it's neither accurate nor needed, but concede that it's likely to stay. A new trend is emerging in psychiatric hospitals. People in crisis are arriving with false, sometimes dangerous beliefs, grandiose delusions, and paranoid thoughts. A common thread connects them: marathon conversations with AI chatbots.


DepressLLM: Interpretable domain-adapted language model for depression detection from real-world narratives

Moon, Sehwan, Lee, Aram, Kim, Jeong Eun, Kang, Hee-Ju, Shin, Il-Seon, Kim, Sung-Wan, Kim, Jae-Min, Jhon, Min, Kim, Ju-Wan

arXiv.org Artificial Intelligence

Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence $\geq$ 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry.


The first trial of generative AI therapy shows it might help with depression

MIT Technology Review

Many psychologists and psychiatrists have shared the vision, noting that fewer than half of people with a mental disorder receive therapy, and those who do might get only 45 minutes per week. Researchers have tried to build tech so that more people can access therapy, but they have been held back by two things. One, a therapy bot that says the wrong thing could result in real harm. That's why many researchers have built bots using explicit programming: The software pulls from a finite bank of approved responses (as was the case with Eliza, a mock-psychotherapist computer program built in the 1960s). But this makes them less engaging to chat with, and people lose interest.


Optimizing Large Language Models for Detecting Symptoms of Comorbid Depression or Anxiety in Chronic Diseases: Insights from Patient Messages

Kim, Jiyeong, Ma, Stephen P., Chen, Michael L., Galatzer-Levy, Isaac R., Torous, John, van Roessel, Peter J., Sharp, Christopher, Pfeffer, Michael A., Rodriguez, Carolyn I., Linos, Eleni, Chen, Jonathan H.

arXiv.org Artificial Intelligence

Patients with diabetes are at increased risk of comorbid depression or anxiety, complicating their management. This study evaluated the performance of large language models (LLMs) in detecting these symptoms from secure patient messages. We applied multiple approaches, including engineered prompts, systemic persona, temperature adjustments, and zero-shot and few-shot learning, to identify the best-performing model and enhance performance. Three out of five LLMs demonstrated excellent performance (over 90% of F-1 and accuracy), with Llama 3.1 405B achieving 93% in both F-1 and accuracy using a zero-shot approach. While LLMs showed promise in binary classification and handling complex metrics like Patient Health Questionnaire-4, inconsistencies in challenging cases warrant further real-life assessment. The findings highlight the potential of LLMs to assist in timely screening and referrals, providing valuable empirical knowledge for real-world triage systems that could improve mental health care for patients with chronic diseases.


PsychBench: A comprehensive and professional benchmark for evaluating the performance of LLM-assisted psychiatric clinical practice

Wang, Ruoxi, Liu, Shuyu, Zhang, Ling, Zhu, Xuequan, Yang, Rui, Zhou, Xinzhu, Wu, Fei, Yang, Zhi, Jin, Cheng, Wang, Gang

arXiv.org Artificial Intelligence

The advent of Large Language Models (LLMs) offers potential solutions to address problems such as shortage of medical resources and low diagnostic consistency in psychiatric clinical practice. Despite this potential, a robust and comprehensive benchmarking framework to assess the efficacy of LLMs in authentic psychiatric clinical environments is absent. This has impeded the advancement of specialized LLMs tailored to psychiatric applications. In response to this gap, by incorporating clinical demands in psychiatry and clinical data, we proposed a benchmarking system, PsychBench, to evaluate the practical performance of LLMs in psychiatric clinical settings. We conducted a comprehensive quantitative evaluation of 16 LLMs using PsychBench, and investigated the impact of prompt design, chain-of-thought reasoning, input text length, and domain-specific knowledge fine-tuning on model performance. Through detailed error analysis, we identified strengths and potential limitations of the existing models and suggested directions for improvement. Subsequently, a clinical reader study involving 60 psychiatrists of varying seniority was conducted to further explore the practical benefits of existing LLMs as supportive tools for psychiatrists of varying seniority. Through the quantitative and reader evaluation, we show that while existing models demonstrate significant potential, they are not yet adequate as decision-making tools in psychiatric clinical practice. The reader study further indicates that, as an auxiliary tool, LLM could provide particularly notable support for junior psychiatrists, effectively enhancing their work efficiency and overall clinical quality. To promote research in this area, we will make the dataset and evaluation framework publicly available, with the hope of advancing the application of LLMs in psychiatric clinical settings.