AITopics | evaluation study

Collaborating Authors

evaluation study

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Hitchhiker's Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

Neural Information Processing SystemsMar-17-2026, 20:30:07 GMT

Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. For that reason, few works have recently started to frame the problem of deepfake detection as a Visual Question Answering (VQA) task, nevertheless omitting the realistic and informative open-ended multi-label setting. With the rapid advances in the field of VLLM, an exponential rise of investigations in that direction is expected. As such, there is a need for a clear experimental methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate different VLLM architectures. Previous evaluation studies in deepfake detection have mostly focused on the simpler binary task, overlooking evaluation protocols for multi-label fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary evaluation protocol and conducts a comprehensive evaluation study to compare the capabilities of several VLLMs in this context. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts.

artificial intelligence, detection, natural language, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.62)

Add feedback

Closing the Performance Gap Between AI and Radiologists in Chest X-Ray Reporting

Sharma, Harshita, Reynolds, Maxwell C., Salvatelli, Valentina, Sykes, Anne-Marie G., Horst, Kelly K., Schwaighofer, Anton, Ilse, Maximilian, Melnichenko, Olesya, Bond-Taylor, Sam, Pérez-García, Fernando, Mugu, Vamshi K., Chan, Alex, Colak, Ceylan, Swartz, Shelby A., Nashawaty, Motassem B., Gonzalez, Austin J., Ouellette, Heather A., Erdal, Selnur B., Schueler, Beth A., Wetscherek, Maria T., Codella, Noel, Jain, Mohit, Bannur, Shruthi, Bouzid, Kenza, Castro, Daniel C., Hyland, Stephanie, Korfiatis, Panos, Khandelwal, Ashish, Alvarez-Valle, Javier

arXiv.org Artificial IntelligenceDec-1-2025

AI-assisted report generation offers the opportunity to reduce radiologists' workload stemming from expanded screening guidelines, complex cases and workforce shortages, while maintaining diagnostic accuracy. In addition to describing pathological findings in chest X-ray reports, interpreting lines and tubes (L&T) is demanding and repetitive for radiologists, especially with high patient volumes. We introduce MAIRA-X, a clinically evaluated multimodal AI model for longitudinal chest X-ray (CXR) report generation, that encompasses both clinical findings and L&T reporting. Developed using a large-scale, multi-site, longitudinal dataset of 3.1 million studies (comprising 6 million images from 806k patients) from Mayo Clinic, MAIRA-X was evaluated on three holdout datasets and the public MIMIC-CXR dataset, where it significantly improved AI-generated reports over the state of the art on lexical quality, clinical correctness, and L&T-related elements. A novel L&T-specific metrics framework was developed to assess accuracy in reporting attributes such as type, longitudinal change and placement. A first-of-its-kind retrospective user evaluation study was conducted with nine radiologists of varying experience, who blindly reviewed 600 studies from distinct subjects. The user study found comparable rates of critical errors (3.0% for original vs. 4.6% for AI-generated reports) and a similar rate of acceptable sentences (97.8% for original vs. 97.4% for AI-generated reports), marking a significant improvement over prior user studies with larger gaps and higher error rates. Our results suggest that MAIRA-X can effectively assist radiologists, particularly in high-volume clinical settings.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2511.21735

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Sensing and Signal Processing > Image Processing (0.92)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)

Add feedback

A Hitchhiker's Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning

Neural Information Processing SystemsMay-26-2025, 15:16:25 GMT

artificial intelligence, detection, natural language, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Vision language models have difficulty recognizing virtual objects

Tran, Tyler, Khemlani, Sangeet, Trafton, J. G.

arXiv.org Artificial IntelligenceMay-16-2025

Vision language models (VLMs) are AI systems paired with both language and vision encoders to process multimodal input. They are capable of performing complex semantic tasks such as automatic captioning, but it remains an open question about how well they comprehend the visuospatial properties of scenes depicted in the images they process. We argue that descriptions of virtual objects -- objects that are not visually represented in an image -- can help test scene comprehension in these AI systems. For example, an image that depicts a person standing under a tree can be paired with the following prompt: imagine that a kite is stuck in the tree. VLMs that comprehend the scene should update their representations and reason sensibly about the spatial relations between all three objects. We describe systematic evaluations of state-of-the-art VLMs and show that their ability to process virtual objects is inadequate.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.10453

Country: North America > United States (0.94)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.47)
Government > Regional Government > North America Government > United States Government (0.46)
Government > Military > Navy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Rethinking Emotion Annotations in the Era of Large Language Models

Niu, Minxue, El-Tawil, Yara, Romana, Amrit, Provost, Emily Mower

arXiv.org Artificial IntelligenceDec-10-2024

Modern affective computing systems rely heavily on datasets with human-annotated emotion labels, for training and evaluation. However, human annotations are expensive to obtain, sensitive to study design, and difficult to quality control, because of the subjective nature of emotions. Meanwhile, Large Language Models (LLMs) have shown remarkable performance on many Natural Language Understanding tasks, emerging as a promising tool for text annotation. In this work, we analyze the complexities of emotion annotation in the context of LLMs, focusing on GPT-4 as a leading model. In our experiments, GPT-4 achieves high ratings in a human evaluation study, painting a more positive picture than previous work, in which human labels served as the only ground truth. On the other hand, we observe differences between human and GPT-4 emotion perception, underscoring the importance of human input in annotation studies. To harness GPT-4's strength while preserving human perspective, we explore two ways of integrating GPT-4 into emotion annotation pipelines, showing its potential to flag low-quality labels, reduce the workload of human annotators, and improve downstream model learning performance and efficiency. Together, our findings highlight opportunities for new emotion labeling practices and suggest the use of LLMs as a promising tool to aid human annotation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2412.07906

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
Europe > Spain > Valencian Community > Alicante Province > Alicante (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Designing an Evaluation Framework for Large Language Models in Astronomy Research

Wu, John F., Hyk, Alina, McCormick, Kiera, Ye, Christine, Astarita, Simone, Baral, Elina, Ciuca, Jo, Cranney, Jesse, Field, Anjalie, Iyer, Kartheik, Koehn, Philipp, Kotler, Jenn, Kruk, Sandor, Ntampaka, Michelle, O'Neill, Charles, Peek, Joshua E. G., Sharma, Sanjib, Yunus, Mikaeel

arXiv.org Artificial IntelligenceMay-30-2024

Large Language Models (LLMs) are shifting how scientific research is done. It is imperative to understand how researchers interact with these models and how scientific sub-communities like astronomy might benefit from them. However, there is currently no standard for evaluating the use of LLMs in astronomy. Therefore, we present the experimental design for an evaluation study on how astronomy researchers interact with LLMs. We deploy a Slack chatbot that can answer queries from users via Retrieval-Augmented Generation (RAG); these responses are grounded in astronomy papers from arXiv. We record and anonymize user questions and chatbot answers, user upvotes and downvotes to LLM responses, user feedback to the LLM, and retrieved documents and similarity scores with the query. Our data collection method will enable future dynamic evaluations of LLM tools for astronomy.

chatbot, llm, query, (15 more...)

arXiv.org Artificial Intelligence

2405.20389

Country:

North America > United States > Maryland > Baltimore (0.05)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > Oregon > Benton County > Corvallis (0.04)
(3 more...)

Genre: Research Report > New Finding (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Detecting Compromised IoT Devices Using Autoencoders with Sequential Hypothesis Testing

Mainuddin, Md, Duan, Zhenhai, Dong, Yingfei

arXiv.org Artificial IntelligenceApr-21-2024

IoT devices fundamentally lack built-in security mechanisms to protect themselves from security attacks. Existing works on improving IoT security mostly focus on detecting anomalous behaviors of IoT devices. However, these existing anomaly detection schemes may trigger an overwhelmingly large number of false alerts, rendering them unusable in detecting compromised IoT devices. In this paper we develop an effective and efficient framework, named CUMAD, to detect compromised IoT devices. Instead of directly relying on individual anomalous events, CUMAD aims to accumulate sufficient evidence in detecting compromised IoT devices, by integrating an autoencoder-based anomaly detection subsystem with a sequential probability ratio test (SPRT)-based sequential hypothesis testing subsystem. CUMAD can effectively reduce the number of false alerts in detecting compromised IoT devices, and moreover, it can detect compromised IoT devices quickly. Our evaluation studies based on the public-domain N-BaIoT dataset show that CUMAD can on average reduce the false positive rate from about 3.57% using only the autoencoder-based anomaly detection scheme to about 0.5%; in addition, CUMAD can detect compromised IoT devices quickly, with less than 5 observations on average.

cumad, detection, iot device, (13 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/BigData59044.2023.10386399

2404.1369

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > Florida > Leon County > Tallahassee (0.04)

Genre: Research Report (0.50)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.93)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)

Technology:

Information Technology > Internet of Things (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Physics-Informed Solution of The Stationary Fokker-Plank Equation for a Class of Nonlinear Dynamical Systems: An Evaluation Study

Alhussein, Hussam, Khasawneh, Mohammed, Daqaq, Mohammed F.

arXiv.org Artificial IntelligenceSep-25-2023

The Fokker-Planck (FP) equation is a linear partial differential equation which governs the temporal and spatial evolution of the probability density function (PDF) associated with the response of stochastic dynamical systems. An exact analytical solution of the FP equation is only available for a limited subset of dynamical systems. Semi-analytical methods are available for larger, yet still a small subset of systems, while traditional computational methods; e.g. Finite Elements and Finite Difference require dividing the computational domain into a grid of discrete points, which incurs significant computational costs for high-dimensional systems. Physics-informed learning offers a potentially powerful alternative to traditional computational schemes. To evaluate its potential, we present a data-free, physics-informed neural network (PINN) framework to solve the FP equation for a class of nonlinear stochastic dynamical systems. In particular, through several examples concerning the stochastic response of the Duffing, Van der Pol, and the Duffing-Van der Pol oscillators, we assess the ability and accuracy of the PINN framework in $i)$ predicting the PDF under the combined effect of additive and multiplicative noise, $ii)$ capturing P-bifurcations of the PDF, and $iii)$ effectively treating high-dimensional systems. Through comparisons with Monte-Carlo simulations and the available literature, we show that PINN can effectively address all of the afore-described points. We also demonstrate that the computational time associated with the PINN solution can be substantially reduced by using transfer learning.

nonlinear dynamical system, physics-informed solution, stationary fokker-plank equation, (1 more...)

arXiv.org Artificial Intelligence

2309.16725

Genre: Research Report (0.40)

Technology:

Information Technology > Scientific Computing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback

GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization

Liu, Yang Janet, Zeldes, Amir

arXiv.org Artificial IntelligenceJun-19-2023

Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to 'hallucinations', low performance on non-news genres, and outputs which are not exactly summaries. Targeting ACL 2023's 'Reality Check' theme, we present GUMSum, a small but carefully crafted dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization. Summaries are highly constrained, focusing on substitutive potential, factuality, and faithfulness. We present guidelines and evaluate human agreement as well as subjective judgments on recent system outputs, comparing general-domain untuned approaches, a fine-tuned one, and a prompt-based approach, to human performance. Results show that while GPT3 achieves impressive scores, it still underperforms humans, with varying quality across genres. Human judgments reveal different types of errors in supervised, prompted, and human-generated summaries, shedding light on the challenges of producing a good summary.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.11256

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(10 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Media > News (0.47)
Education (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Responsibility Perspective Transfer for Italian Femicide News

Minnema, Gosse, Lai, Huiyuan, Muscato, Benedetta, Nissim, Malvina

arXiv.org Artificial IntelligenceJun-1-2023

Different ways of linguistically expressing the same real-world event can lead to different perceptions of what happened. Previous work has shown that different descriptions of gender-based violence (GBV) influence the reader's perception of who is to blame for the violence, possibly reinforcing stereotypes which see the victim as partly responsible, too. As a contribution to raise awareness on perspective-based writing, and to facilitate access to alternative perspectives, we introduce the novel task of automatically rewriting GBV descriptions as a means to alter the perceived level of responsibility on the perpetrator. We present a quasi-parallel dataset of sentences with low and high perceived responsibility levels for the perpetrator, and experiment with unsupervised (mBART-based), zero-shot and few-shot (GPT3-based) methods for rewriting sentences. We evaluate our models using a questionnaire study and a suite of automatic metrics.

computational linguistic, perpetrator, responsibility, (16 more...)

arXiv.org Artificial Intelligence

2306.00437

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy (0.05)
North America > Dominican Republic (0.04)
(7 more...)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.66)
Law > Criminal Law (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)

Add feedback