AITopics | Chen, Qingyu

Collaborating Authors

Chen, Qingyu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Xuan, Weihao, Yang, Rui, Qi, Heli, Zeng, Qingcheng, Xiao, Yunze, Xing, Yun, Wang, Junjue, Li, Huitao, Li, Xin, Yu, Kunyu, Liu, Nan, Chen, Qingyu, Teodoro, Douglas, Marrese-Taylor, Edison, Lu, Shijian, Iwasawa, Yusuke, Matsuo, Yutaka, Li, Irene

arXiv.org Artificial IntelligenceMar-13-2025

Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.10497

Country:

Asia > Japan (0.14)
Asia > Thailand (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking

Chen, Yingjian, Liu, Haoran, Liu, Yinhong, Yang, Rui, Yuan, Han, Fu, Yanran, Zhou, Pengyuan, Chen, Qingyu, Caverlee, James, Li, Irene

arXiv.org Artificial IntelligenceFeb-23-2025

Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose \textbf{\textit{GraphCheck}}, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains which are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate a 6.1\% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2502.16514

Country:

Asia (0.93)
North America > Mexico > Mexico City (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

Hou, Qingshan, Zhou, Yukun, Goh, Jocelyn Hui Lin, Zou, Ke, Yew, Samantha Min Er, Srinivasan, Sahana, Wang, Meng, Lo, Thaddaeus, Lei, Xiaofeng, Wagner, Siegfried K., Chia, Mark A., Yang, Dawei, Jiang, Hongyang, Ran, AnRan, Santos, Rui, Somfai, Gabor Mark, Zhou, Juan Helen, Chen, Haoyu, Chen, Qingyu, Cheung, Carol Yim-Lui, Keane, Pearse A., Tham, Yih Chung

arXiv.org Artificial IntelligenceFeb-10-2025

The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2502.06289

Country:

Asia > China > Guangdong Province (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

Srinivasan, Sahana, Ai, Xuguang, Zou, Minjie, Zou, Ke, Kim, Hyunjae, Lo, Thaddaeus Wai Soon, Pushpanathan, Krithi, Kong, Yiming, Li, Anran, Singer, Maxwell, Jin, Kai, Antaki, Fares, Chen, David Ziyou, Liu, Dianbo, Adelman, Ron A., Chen, Qingyu, Tham, Yih Chung

arXiv.org Artificial IntelligenceJan-19-2025

Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.13949

Country:

North America > United States (0.28)
Asia > China > Zhejiang Province (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.96)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)

Add feedback

Enhancing Patient-Centric Communication: Leveraging LLMs to Simulate Patient Perspectives

Ma, Xinyao, Zhu, Rui, Wang, Zihao, Xiong, Jingwei, Chen, Qingyu, Tang, Haixu, Camp, L. Jean, Ohno-Machado, Lucila

arXiv.org Artificial IntelligenceJan-12-2025

Large Language Models (LLMs) have demonstrated impressive capabilities in role-playing scenarios, particularly in simulating domain-specific experts using tailored prompts. This ability enables LLMs to adopt the persona of individuals with specific backgrounds, offering a cost-effective and efficient alternative to traditional, resource-intensive user studies. By mimicking human behavior, LLMs can anticipate responses based on concrete demographic or professional profiles. In this paper, we evaluate the effectiveness of LLMs in simulating individuals with diverse backgrounds and analyze the consistency of these simulated behaviors compared to real-world outcomes. In particular, we explore the potential of LLMs to interpret and respond to discharge summaries provided to patients leaving the Intensive Care Unit (ICU). We evaluate and compare with human responses the comprehensibility of discharge summaries among individuals with varying educational backgrounds, using this analysis to assess the strengths and limitations of LLM-driven simulations. Notably, when LLMs are primed with educational background information, they deliver accurate and actionable medical guidance 88% of the time. However, when other information is provided, performance significantly drops, falling below random chance levels. This preliminary study shows the potential benefits and pitfalls of automatically generating patient-specific health information from diverse populations. While LLMs show promise in simulating health personas, our results highlight critical gaps that must be addressed before they can be reliably used in clinical settings. Our findings suggest that a straightforward query-response model could outperform a more tailored approach in delivering health information. This is a crucial first step in understanding how LLMs can be optimized for personalized health communication while maintaining accuracy.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.06964

Country: North America > United States > California > Yolo County > Davis (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Consumer Health (1.00)
Education > Educational Setting > Higher Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models?

Hu, Yan, Zuo, Xu, Zhou, Yujia, Peng, Xueqing, Huang, Jimin, Keloth, Vipina K., Zhang, Vincent J., Weng, Ruey-Ling, Chen, Qingyu, Jiang, Xiaoqian, Roberts, Kirk E., Xu, Hua

arXiv.org Artificial IntelligenceJan-7-2025

Backgrounds: Information extraction (IE) is critical in clinical natural language processing (NLP). While large language models (LLMs) excel on generative tasks, their performance on extractive tasks remains debated. Methods: We investigated Named Entity Recognition (NER) and Relation Extraction (RE) using 1,588 clinical notes from four sources (UT Physicians, MTSamples, MIMIC-III, and i2b2). We developed an annotated corpus covering 4 clinical entities and 16 modifiers, and compared instruction-tuned LLaMA-2 and LLaMA-3 against BERT in terms of performance, generalizability, computational resources, and throughput to BERT. Results: LLaMA models outperformed BERT across datasets. With sufficient training data, LLaMA showed modest improvements (1% on NER, 1.5-3.7% on RE); improvements were larger with limited training data. On unseen i2b2 data, LLaMA-3-70B outperformed BERT by 7% (F1) on NER and 4% on RE. However, LLaMA models required more computing resources and ran up to 28 times slower. We implemented "Kiwi," a clinical IE package featuring both models, available at https://kiwi.clinicalnlp.org/. Conclusion: This study is among the first to develop and evaluate a comprehensive clinical IE system using open-source LLMs. Results indicate that LLaMA models outperform BERT for clinical NER and RE but with higher computational costs and lower throughputs. These findings highlight that choosing between LLMs and traditional deep learning methods for clinical IE applications should remain task-specific, taking into account both performance metrics and practical considerations such as available computing resources and the intended use case scenarios.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.1002

Country: North America > United States > Texas (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Demystifying Large Language Models for Medicine: A Primer

Jin, Qiao, Wan, Nicholas, Leaman, Robert, Tian, Shubo, Wang, Zhizheng, Yang, Yifan, Wang, Zifeng, Xiong, Guangzhi, Lai, Po-Ting, Zhu, Qingqing, Hou, Benjamin, Sarfo-Gyamfi, Maame, Zhang, Gongbo, Gilson, Aidan, Bhasuran, Balu, He, Zhe, Zhang, Aidong, Sun, Jimeng, Weng, Chunhua, Summers, Ronald M., Chen, Qingyu, Peng, Yifan, Lu, Zhiyong

arXiv.org Artificial IntelligenceNov-19-2024

Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare by generating human-like responses across diverse contexts and adapting to novel tasks following human instructions. Their potential application spans a broad range of medical tasks, such as clinical documentation, matching patients to clinical trials, and answering medical questions. In this primer paper, we propose an actionable guideline to help healthcare professionals more efficiently utilize LLMs in their work, along with a set of best practices. This approach consists of several main phases, including formulating the task, choosing LLMs, prompt engineering, fine-tuning, and deployment. We start with the discussion of critical considerations in identifying healthcare tasks that align with the core capabilities of LLMs and selecting models based on the selected task and data, performance requirements, and model interface. We then review the strategies, such as prompt engineering and fine-tuning, to adapt standard LLMs to specialized medical tasks. Deployment considerations, including regulatory compliance, ethical guidelines, and continuous monitoring for fairness and bias, are also discussed. By providing a structured step-by-step methodology, this tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice, ensuring that these powerful technologies are applied in a safe, reliable, and impactful manner.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.18856

Country:

North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
North America > United States > Illinois > Champaign County (0.14)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.89)
Health & Medicine > Health Care Technology > Medical Record (0.68)
Health & Medicine > Diagnostic Medicine > Imaging (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

Wan, Nicholas, Jin, Qiao, Chan, Joey, Xiong, Guangzhi, Applebaum, Serina, Gilson, Aidan, McMurry, Reid, Taylor, R. Andrew, Zhang, Aidong, Chen, Qingyu, Lu, Zhiyong

arXiv.org Artificial IntelligenceNov-8-2024

Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2411.05897

Country: North America > United States > Virginia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.94)

Industry:

Health & Medicine > Health Care Providers & Services (0.94)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

MedINST: Meta Dataset of Biomedical Instructions

Han, Wenhan, Fang, Meng, Zhang, Zihan, Yin, Yu, Song, Zirui, Chen, Ling, Pechenizkiy, Mykola, Chen, Qingyu

arXiv.org Artificial IntelligenceOct-17-2024

The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises 133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Using MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs' generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.13458

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Language Enhanced Model for Eye (LEME): An Open-Source Ophthalmology-Specific Large Language Model

Gilson, Aidan, Ai, Xuguang, Xie, Qianqian, Srinivasan, Sahana, Pushpanathan, Krithi, Singer, Maxwell B., Huang, Jimin, Kim, Hyunjae, Long, Erping, Wan, Peixing, Del Priore, Luciano V., Ohno-Machado, Lucila, Xu, Hua, Liu, Dianbo, Adelman, Ron A., Tham, Yih-Chung, Chen, Qingyu

arXiv.org Artificial IntelligenceSep-30-2024

Large Language Models (LLMs) are poised to revolutionize healthcare. Ophthalmology-specific LLMs remain scarce and underexplored. We introduced an open-source, specialized LLM for ophthalmology, termed Language Enhanced Model for Eye (LEME). LEME was initially pre-trained on the Llama2 70B framework and further fine-tuned with a corpus of ~127,000 non-copyrighted training instances curated from ophthalmology-specific case reports, abstracts, and open-source study materials. We benchmarked LEME against eight other LLMs, namely, GPT-3.5, GPT-4, three Llama2 models (7B, 13B, 70B), PMC-LLAMA 13B, Meditron 70B, and EYE-Llama (another ophthalmology-specific LLM). Evaluations included four internal validation tasks: abstract completion, fill-in-the-blank, multiple-choice questions (MCQ), and short-answer QA. External validation tasks encompassed long-form QA, MCQ, patient EHR summarization, and clinical QA. Evaluation metrics included Rouge-L scores, accuracy, and expert evaluation of correctness, completeness, and readability. In internal validations, LEME consistently outperformed its counterparts, achieving Rouge-L scores of 0.20 in abstract completion (all p<0.05), 0.82 in fill-in-the-blank (all p<0.0001), and 0.22 in short-answer QA (all p<0.0001, except versus GPT-4). In external validations, LEME excelled in long-form QA with a Rouge-L of 0.19 (all p<0.0001), ranked second in MCQ accuracy (0.68; all p<0.0001), and scored highest in EHR summarization and clinical QA (ranging from 4.24 to 4.83 out of 5 for correctness, completeness, and readability). LEME's emphasis on robust fine-tuning and the use of non-copyrighted data represents a breakthrough in open-source ophthalmology-specific LLMs, offering the potential to revolutionize execution of clinical tasks while democratizing research collaboration.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2410.0374

Country:

North America > United States > Massachusetts (0.14)
North America > United States > Connecticut (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback