AITopics | nlp metric

Collaborating Authors

nlp metric

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

Song, Xinyi, Lee, Lina, Xie, Kexin, Liu, Xueying, Deng, Xinwei, Hong, Yili

arXiv.org Artificial IntelligenceFeb-24-2025

The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the accuracy of code generated by LLMs. A major challenge in this evaluation lies in the absence of a benchmark dataset for statistical code (e.g., SAS and R). To fill in this gap, this paper introduces StatLLM, an open-source dataset for evaluating the performance of LLMs in statistical analysis. The StatLLM dataset comprises three key components: statistical analysis tasks, LLM-generated SAS code, and human evaluation scores. The first component includes statistical analysis tasks spanning a variety of analyses and datasets, providing problem descriptions, dataset details, and human-verified SAS code. The second component features SAS code generated by ChatGPT 3.5, ChatGPT 4.0, and Llama 3.1 for those tasks. The third component contains evaluation scores from human experts in assessing the correctness, effectiveness, readability, executability, and output accuracy of the LLM-generated code. We also illustrate the unique potential of the established benchmark dataset for (1) evaluating and enhancing natural language processing metrics, (2) assessing and improving LLM performance in statistical coding, and (3) developing and testing of next-generation statistical software - advancements that are crucial for data science and machine learning research.

dataset, evaluation, sas code, (14 more...)

arXiv.org Artificial Intelligence

2502.17657

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Virginia (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

APT-Pipe: An Automatic Prompt-Tuning Tool for Social Computing Data Annotation

Zhu, Yiming, Yin, Zhizhuo, Tyson, Gareth, Haq, Ehsan-Ul, Lee, Lik-Hang, Hui, Pan

arXiv.org Artificial IntelligenceFeb-8-2024

Recent research has highlighted the potential of LLM applications, like ChatGPT, for performing label annotation on social computing text. However, it is already well known that performance hinges on the quality of the input prompts. To address this, there has been a flurry of research into prompt tuning -- techniques and guidelines that attempt to improve the quality of prompts. Yet these largely rely on manual effort and prior knowledge of the dataset being annotated. To address this limitation, we propose APT-Pipe, an automated prompt-tuning pipeline. APT-Pipe aims to automatically tune prompts to enhance ChatGPT's text classification performance on any given dataset. We implement APT-Pipe and test it across twelve distinct text classification datasets. We find that prompts tuned by APT-Pipe help ChatGPT achieve higher weighted F1-score on nine out of twelve experimented datasets, with an improvement of 7.01% on average. We further highlight APT-Pipe's flexibility as a framework by showing how it can be extended to support additional tuning mechanisms.

apt-pipe, chatgpt, dataset, (13 more...)

arXiv.org Artificial Intelligence

2402.01697

Country:

Asia > Singapore > Central Region > Singapore (0.05)
Asia > China > Hong Kong (0.05)
Asia > China > Guangdong Province > Guangzhou (0.05)
(5 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Inspecting state of the art performance and NLP metrics in image-based medical report generation

Pino, Pablo, Parra, Denis, Messina, Pablo, Besa, Cecilia, Uribe, Sergio

arXiv.org Artificial IntelligenceNov-21-2020

Several deep learning architectures have been proposed over the last years to deal with the problem of generating a written report given an imaging exam as input. Most works evaluate the generated reports using standard Natural Language Processing (NLP) metrics (e.g. BLEU, ROUGE), reporting significant progress. In this article, we contrast this progress by comparing state of the art (SOTA) models against weak baselines. We show that simple and even naive approaches yield near SOTA performance on most traditional NLP metrics. We conclude that evaluation methods in this task should be further studied towards correctly measuring clinical accuracy, ideally involving physicians to contribute to this end.

baseline, medical report generation, nlp metric, (10 more...)

arXiv.org Artificial Intelligence

2011.09257

Country:

South America > Chile (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.73)
Health & Medicine > Nuclear Medicine (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.76)

Add feedback