Goto

Collaborating Authors

 Large Language Model



MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models

Neural Information Processing Systems

However, there is little to no understanding of the notion of medical safety in the context of LLMs, let alone how to evaluate and improve it. To address this gap, we first define the notion of medical safety in LLMs based on the Principles of Medical Ethics set forth by the American Medical Association.


ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

Neural Information Processing Systems

GPT -4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from



Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning

Neural Information Processing Systems

Recently, large language models (LLMs) ( e.g., GPT -3 and ChatGPT) have shown remarkable zero-shot and few-shot performance on various tasks [