AITopics | baseline performance

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsNov-13-2025, 14:48:43 GMT

0f6931a9e339a012a9909306d7c758b4-Supplemental-Conference.pdf

dataset, prevalence, variant, (15 more...)

Country:

Europe > Austria > Vienna (0.05)
Oceania > Australia > Queensland (0.05)
Asia (0.05)
Africa (0.05)

Industry:

Health & Medicine > Health Care Providers & Services (0.55)
Health & Medicine > Therapeutic Area > Oncology (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Zhao, Zhilong, Liu, Yindi

A Hierarchical Error Framework for Reliable Automated Coding in Communication Research: Applications to Health and Political Communication

arXiv.org Artificial IntelligenceOct-27-2025

Automated content analysis increasingly supports communication research, yet scaling manual coding into computational pipelines raises concerns about measurement reliability and validity. We introduce a Hierarchical Error Correction (HEC) framework that treats model failures as layered measurement errors (knowledge gaps, reasoning limitations, and complexity constraints) and targets the layers that most affect inference. The framework implements a three-phase methodology: systematic error profiling across hierarchical layers, targeted intervention design matched to dominant error sources, and rigorous validation with statistical testing. Evaluating HEC across health communication (medical specialty classification) and political communication (bias detection), and legal tasks, we validate the approach with five diverse large language models. Results show average accuracy gains of 11.2 percentage points (p < .001, McNemar's test) and stable conclusions via reduced systematic misclassification. Cross-model validation demonstrates consistent improvements (range: +6.8 to +14.6pp), with effectiveness concentrated in moderate-to-high baseline tasks (50-85% accuracy). A boundary study reveals diminished returns in very high-baseline (>85%) or precision-matching tasks, establishing applicability limits. We map layered errors to threats to construct and criterion validity and provide a transparent, measurement-first blueprint for diagnosing error profiles, selecting targeted interventions, and reporting reliability/validity evidence alongside accuracy. This applies to automated coding across communication research and the broader social sciences.

large language model, machine learning, natural language, (20 more...)

2509.24841

Country: Asia > China (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Law (0.95)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Neural Information Processing SystemsOct-9-2025, 18:38:38 GMT

Incorporating Surrogate Gradient Norm to Improve Offline Optimization Techniques

Offline optimization has recently emerged as an increasingly popular approach to mitigate the prohibitively expensive cost of online experimentation.

experiment, ignite, ignite 0, (15 more...)

Country:

North America > United States > Washington (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.45)

Neural Information Processing SystemsOct-8-2025, 03:20:09 GMT

0f6931a9e339a012a9909306d7c758b4-Supplemental-Conference.pdf

artificial intelligence, dataset, machine learning, (17 more...)

Country:

Europe > Austria > Vienna (0.05)
Oceania > Australia > Queensland (0.05)
Asia (0.05)
Africa (0.05)

Industry:

Health & Medicine > Health Care Providers & Services (0.55)
Health & Medicine > Therapeutic Area > Oncology (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Navarro, Maria Conchita Agana, Li, Geng, Wolf, Theo, Pérez-Ortiz, María

Do machine learning climate models work in changing climate dynamics?

arXiv.org Artificial IntelligenceSep-16-2025

Our baseline runs followed the ClimateSet single emulator specifications (Kaltenborn et al., 2023): Training Process: Each emulator is trained on data from a single climate model, predicting outputs for an entire sequence of monthly data for each year. Pre-Processing: The data has been pre-processed by ClimateSet to have a spatial resolution of approximately 250 km (144 x 96 longitude-latitude cells) and a temporal resolution of monthly data. The time series is divided into 1-year chunks, resulting in data with a shape of scenarios, years * months, variables, longitude, latitude . Input and Output Shapes: The input data has the shape batch, sequence length, num vars, lon, lat, where the sequence length is 12 (monthly data). The output has the shape batch, sequence length, 2, lon, lat, where the '2' corresponds to temperature (T AS) and precipitation (PR). Training Parameters: The models are trained for 50 epochs with an initial learning rate of 2e-4, using an exponential decay scheduler. For the non-frozen ClimaX models, training begins with a 5-epoch warm-up phase at 1e-8, followed by training at 5e-4. Loss: The latitude-longitude weighted mean squared error (LLMSE) as implemented in (Nguyen et al., 2023) is used.

artificial intelligence, machine learning, scenario, (18 more...)

2509.12147

Country: Europe > United Kingdom > England (0.14)

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceSep-16-2025

Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

Gokdemir, Ozan, Getty, Neil, Underwood, Robert, Madireddy, Sandeep, Cappello, Franck, Ramanathan, Arvind, Foster, Ian T., Stevens, Rick L.

As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for generating multiple-choice question-answering (MCQA) benchmarks directly from large corpora of scientific papers. Our pipeline automates every stage of MCQA creation, including PDF parsing, semantic chunking, question generation, and model evaluation. As a case study, we generate more than 16,000 MCQs from 22,000 open-access articles in radiation and cancer biology. We then evaluate a suite of small language models (1.1B-14B parameters) on these questions, comparing baseline accuracy with retrieval-augmented generation (RAG) from paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1. We find that reasoning-trace retrieval consistently improves performance on both synthetic and expert-annotated benchmarks, enabling several small models to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.

large language model, machine learning, natural language, (18 more...)

2509.10744

Country: North America > United States > Illinois (0.28)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Oncology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Gabriel, Roy M., Zandehshahvar, Mohammadreza, van Assen, Marly, Kittisut, Nattakorn, Peters, Kyle, De Cecco, Carlo N., Adibi, Ali

Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance

arXiv.org Artificial IntelligenceSep-1-2025

To reduce the amount of required labeled data for lung disease severity classification from chest X-rays (CXRs) under class imbalance, this study applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function. This retrospective study collected 2,319 CXRs from 963 patients (mean age, 59.2 $\pm$ 16.6 years; 481 female) at Emory Healthcare affiliated hospitals between January and November 2020. All patients had clinically confirmed COVID-19. Each CXR was independently labeled by 3 to 6 board-certified radiologists as normal, moderate, or severe. A deep neural network with Monte Carlo Dropout was trained using active learning to classify disease severity. Various acquisition functions were used to iteratively select the most informative samples from an unlabeled pool. Performance was evaluated using accuracy, area under the receiver operating characteristic curve (AU ROC), and area under the precision-recall curve (AU PRC). Training time and acquisition time were recorded. Statistical analysis included descriptive metrics and performance comparisons across acquisition strategies. Entropy Sampling achieved 93.7% accuracy (AU ROC, 0.91) in binary classification (normal vs. diseased) using 15.4% of the training data. In the multi-class setting, Mean STD sampling achieved 70.3% accuracy (AU ROC, 0.86) using 23.1% of the labeled data. These methods outperformed more complex and computationally expensive acquisition functions and significantly reduced labeling needs. Deep active learning with BNN approximation and weighted loss effectively reduces labeled data requirements while addressing class imbalance, maintaining or exceeding diagnostic performance.

acquisition function, artificial intelligence, machine learning, (16 more...)

2508.21263

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.93)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)

arXiv.org Artificial IntelligenceAug-26-2025

Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study

Gonzalez-Machorro, Monica, Reichel, Uwe, Hecker, Pascal, Hammer, Helly, Sagha, Hesam, Eyben, Florian, Hoepner, Robert, Schuller, Björn W.

Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.

artificial intelligence, depression, machine learning, (20 more...)

2508.18092

Country: Europe (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology > Multiple Sclerosis (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Vishwanath, Krithik, Alyakin, Anton, Ghosh, Mrigayu, Lee, Jin Vivian, Alber, Daniel Alexander, Sangwon, Karl L., Kondziolka, Douglas, Oermann, Eric Karl

Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

arXiv.org Artificial IntelligenceMay-30-2025

The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.

accuracy, large language model, machine learning, (19 more...)

2505.23477

Country:

North America > United States > Texas > Travis County > Austin (0.14)
North America > United States > New York > New York County > New York City (0.05)
North America > United States > Missouri > St. Louis County > St. Louis (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)