Gupta, Ashok
Domain-specific Question Answering with Hybrid Search
Sultania, Dewang, Lu, Zhaoyu, Naik, Twisha, Dernoncourt, Franck, Yoon, David Seunghyun, Sharma, Sanat, Bui, Trung, Gupta, Ashok, Vatsa, Tushar, Suresha, Suhas, Verma, Ishita, Belavadi, Vibha, Chen, Cheng, Friedrich, Michael
With the increasing adoption of Large Language Models A production-ready, generalizable framework for LLMbased (LLMs) in enterprise settings, ensuring accurate and reliable QA systems built on Elasticsearch question-answering systems remains a critical challenge. A flexible hybrid retrieval mechanism combining dense Building upon our previous work on domain-specific and sparse search methods question answering about Adobe products (Sharma et al. A comprehensive evaluation framework for assessing 2024), which established a retrieval-aware framework with QA system performance self-supervised training, we now present a production-ready, Empirical analysis demonstrating the effectiveness of our generalizable architecture alongside a comprehensive evaluation approach across various metrics methodology. Our core contribution is a flexible, scalable framework built on Elasticsearch that can be adapted Through this work, we provide not only theoretical insights for any LLM-based question-answering system. This framework but also a practical, deployable solution for building reliable seamlessly integrates hybrid retrieval mechanisms, domain-specific question-answering systems that can combining dense and sparse search with boost matching, be adapted to various enterprise needs.
The Potential and Pitfalls of using a Large Language Model such as ChatGPT or GPT-4 as a Clinical Assistant
Zhang, Jingqing, Sun, Kai, Jagadeesh, Akshay, Ghahfarokhi, Mahta, Gupta, Deepa, Gupta, Ashok, Gupta, Vibhor, Guo, Yike
Recent studies have demonstrated promising performance of ChatGPT and GPT-4 on several medical domain tasks. However, none have assessed its performance using a large-scale real-world electronic health record database, nor have evaluated its utility in providing clinical diagnostic assistance for patients across a full range of disease presentation. We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database and the other, in providing diagnostic assistance to healthcare workers in the prospective evaluation of hypothetical patients. Our results show that GPT-4 across disease classification tasks with chain of thought and few-shot prompting can achieve performance as high as 96% F1 scores. For patient assessment, GPT-4 can accurately diagnose three out of four times. However, there were mentions of factually incorrect statements, overlooking crucial medical findings, recommendations for unnecessary investigations and overtreatment. These issues coupled with privacy concerns, make these models currently inadequate for real world clinical use. However, limited data and time needed for prompt engineering in comparison to configuration of conventional machine learning workflows highlight their potential for scalability across healthcare applications.