Large Language Model
Utilizing ChatGPT to Enhance Clinical Trial Enrollment
Peikos, Georgios, Symeonidis, Symeon, Kasela, Pranav, Pasi, Gabriella
Clinical trials are a critical component of evaluating the effectiveness of new medical interventions and driving advancements in medical research. Therefore, timely enrollment of patients is crucial to prevent delays or premature termination of trials. In this context, Electronic Health Records (EHRs) have emerged as a valuable tool for identifying and enrolling eligible participants. In this study, we propose an automated approach that leverages ChatGPT, a large language model, to extract patient-related information from unstructured clinical notes and generate search queries for retrieving potentially eligible clinical trials. Our empirical evaluation, conducted on two benchmark retrieval collections, shows improved retrieval performance compared to existing approaches when several general-purposed and task-specific prompts are used. Notably, ChatGPT-generated queries also outperform human-generated queries in terms of retrieval performance. These findings highlight the potential use of ChatGPT to enhance clinical trial enrollment while ensuring the quality of medical service and minimizing direct risks to patients.
Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions
Yang, Hui, Yue, Sifu, He, Yunzhong
Auto-GPT is an autonomous agent that leverages recent advancements in adapting Large Language Models (LLMs) for decision-making tasks. While there has been a growing interest in Auto-GPT stypled agents, questions remain regarding the effectiveness and flexibility of Auto-GPT in solving real-world decision-making tasks. Its limited capability for real-world engagement and the absence of benchmarks contribute to these uncertainties. In this paper, we present a comprehensive benchmark study of Auto-GPT styled agents in decision-making tasks that simulate real-world scenarios. Our aim is to gain deeper insights into this problem and understand the adaptability of GPT-based agents. We compare the performance of popular LLMs such as GPT-4, GPT-3.5, Claude, and Vicuna in Auto-GPT styled decision-making tasks. Furthermore, we introduce the Additional Opinions algorithm, an easy and effective method that incorporates supervised/imitation-based learners into the Auto-GPT scheme. This approach enables lightweight supervised learning without requiring fine-tuning of the foundational LLMs. We demonstrate through careful baseline comparisons and ablation studies that the Additional Opinions algorithm significantly enhances performance in online decision-making benchmarks, including WebShop and ALFWorld.
Towards Coding Social Science Datasets with Language Models
Rytting, Christopher Michael, Sorensen, Taylor, Argyle, Lisa, Busby, Ethan, Fulda, Nancy, Gubler, Joshua, Wingate, David
Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This kind of human coding forms an important part of social science research, yet the coding process is both resource intensive and highly variable from application to application. In some cases, efforts to automate this process have achieved human-level accuracies, but to achieve this, these attempts frequently rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and costly for large ones. Recent advances in a specific kind of artificial intelligence tool - language models (LMs) - provide a solution to this problem. Work in computer science makes it clear that LMs are able to classify text, without the cost (in financial terms and human effort) of alternative methods. To demonstrate the possibilities of LMs in this area of political science, we use GPT-3, one of the most advanced LMs, as a synthetic coder and compare it to human coders. We find that GPT-3 can match the performance of typical human coders and offers benefits over other machine learning methods of coding text. We find this across a variety of domains using very different coding procedures. This provides exciting evidence that language models can serve as a critical advance in the coding of open-ended texts in a variety of applications.
ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation
Yim, Wen-wai, Fu, Yujuan, Abacha, Asma Ben, Snider, Neal, Lin, Thomas, Yetisgen, Meliha
Recent immense breakthroughs in generative models such as in GPT4 have precipitated re-imagined ubiquitous usage of these models in all applications. One area that can benefit by improvements in artificial intelligence (AI) is healthcare. The note generation task from doctor-patient encounters, and its associated electronic medical record documentation, is one of the most arduous time-consuming tasks for physicians. It is also a natural prime potential beneficiary to advances in generative models. However with such advances, benchmarking is more critical than ever. Whether studying model weaknesses or developing new evaluation metrics, shared open datasets are an imperative part of understanding the current state-of-the-art. Unfortunately as clinic encounter conversations are not routinely recorded and are difficult to ethically share due to patient confidentiality, there are no sufficiently large clinic dialogue-note datasets to benchmark this task. Here we present the Ambient Clinical Intelligence Benchmark (ACI-BENCH) corpus, the largest dataset to date tackling the problem of AI-assisted note generation from visit dialogue. We also present the benchmark performances of several common state-of-the-art approaches.
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Hoelscher-Obermaier, Jason, Persson, Julia, Kran, Esben, Konstas, Ioannis, Barez, Fazl
Recent model editing techniques promise to mitigate the problem of memorizing false or outdated associations during LLM training. However, we show that these techniques can introduce large unwanted side effects which are not detected by existing specificity benchmarks. We extend the existing CounterFact benchmark to include a dynamic component and dub our benchmark CounterFact+. Additionally, we extend the metrics used for measuring specificity by a principled KL divergence-based metric. We use this improved benchmark to evaluate recent model editing techniques and find that they suffer from low specificity. Our findings highlight the need for improved specificity benchmarks that identify and prevent unwanted side effects.
WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models
Giorgi, John, Toma, Augustin, Xie, Ronald, Chen, Sondra S., An, Kevin R., Zheng, Grace X., Wang, Bo
This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.
Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations
Lyu, Xinxi, Min, Sewon, Beltagy, Iz, Zettlemoyer, Luke, Hajishirzi, Hannaneh
Although large language models can be prompted for both zero- and few-shot learning, performance drops significantly when no demonstrations are available. In this paper, we introduce Z-ICL, a new zero-shot method that closes the gap by constructing pseudo-demonstrations for a given test input using a raw text corpus. Concretely, pseudo-demonstrations are constructed by (1) finding the nearest neighbors to the test input from the corpus and pairing them with random task labels, and (2) applying a set of techniques to reduce the amount of direct copying the model does from the resulting demonstrations. Evaluation on nine classification datasets shows that Z-ICL outperforms previous zero-shot methods by a significant margin, and is on par with in-context learning with labeled training data in the few-shot setting. Overall, Z-ICL provides a significantly higher estimate of the zero-shot performance levels of a model, and supports future efforts to develop better pseudo-demonstrations that further improve zero-shot results.
Japan issues administrative guidance to ChatGPT operator
The government said Friday it has issued administrative guidance to ChatGPT operator OpenAI due to its insufficient consideration of protocols to protect personal information. The guidance, issued Thursday by the government's Personal Information Protection Commission and based on the personal information protection law, pointed to the possibility of ChatGPT infringing on privacy by obtaining sensitive personal information without prior consent. The commission said that it has not confirmed any specific violation of the law so far. This could be due to a conflict with your ad-blocking or security software. Please add japantimes.co.jp and piano.io to your list of allowed sites.
Should We, and Can We, Put the Brakes on Artificial Intelligence?
Sign up to receive our weekly newsletter of the best New Yorker podcasts. Sam Altman, the C.E.O. of OpenAI, which created ChatGPT, says that artificial intelligence is a powerful tool that will streamline human work and quicken the pace of scientific advancement. But ChatGPT has both enthralled and terrified us, and even some of A.I.'s pioneers are freaked out by the technology and how quickly it has advanced. David Remnick talks with Altman, and with the computer scientist Yoshua Bengio, who won the prestigious Turing Award for his work in 2018, but recently signed an open letter calling for a moratorium on some A.I. research until regulation can be implemented. The stakes, Bengio says, are high: "I believe there is a non-negligible risk that this kind of technology, in the short term, could disrupt democracies."
AI Doomerism Is a Decoy
On Tuesday morning, the merchants of artificial intelligence warned once again about the existential might of their products. Hundreds of AI executives, researchers, and other tech and business figures, including OpenAI CEO Sam Altman and Bill Gates, signed a one-sentence statement written by the Center for AI Safety declaring that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war." Those 22 words were released following a multi-week tour in which executives from OpenAI, Microsoft, Google, and other tech companies called for limited regulation of AI. They spoke before Congress, in the European Union, and elsewhere about the need for industry and governments to collaborate to curb their product's harms--even as their companies continue to invest billions in the technology. Several prominent AI researchers and critics told me that they're skeptical of the rhetoric, and that Big Tech's proposed regulations appear defanged and self-serving.