Artificial Intelligence health advice accuracy varies across languages and contexts

Garg, Prashant, Fetzer, Thiemo

arXiv.org Artificial Intelligence 

Using basic health statements authorized by UK and EU registers and ~9,100 journalist - vetted public - health assertions on topics such as abortion, COVID - 19 and politics from sources ranging from peer - reviewed journals and government advisories to social med ia and news across the political spectrum, we benchmark six leading large language models from in 21 languages, finding that -- despite high accuracy on English - centric textbook claims -- performance falls in multiple non - European languages and fluctuates by top ic and source, highlighting the urgency of comprehensive multilingual, domain - aware validation before deploying AI in global health communication. Main Text: Recent evidence suggests that 17 % of U.S. adults -- and a striking 25 % of those aged 18 - 29 -- now consult AI chatbots for health questions at least once a month (1), while in Australia nearly 10 % of adults did so in just the first half of 2024 (2). Beyond mere curiosity, these tools can substantially improve comprehension: running standard d ischarge notes through GPT - 4 reduced the average reading grade level from 11th to 6th and boosted patient - understandability scores from 13 % to 81 % (3). Yet as fluently as large language models (LLMs) can rephrase medical text, they lack formal clinical v etting and still rely on statistical patterns in their training data. When generative AI echoes unverified or dangerous claims, it risks amplifying harm.