AITopics | benchmark framework

Collaborating Authors

benchmark framework

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing SystemsDec-23-2025, 20:07:31 GMT

Accessible high-quality data is the bread and butter of machine learning research, and the demand for data has exploded as larger and more advanced ML models are built across different domains. Yet, real data often contain sensitive information, are subject to various biases, and are costly to acquire, which compromise their quality and accessibility. Synthetic data have thus emerged as a complement to, sometimes even a replacement for, real data for ML training. However, the landscape of synthetic data research has been fragmented due to the diverse range of data modalities, such as tabular, time series, and images, and the wide array of use cases, including privacy preservation, fairness considerations, and data augmentation. This fragmentation poses practical challenges when comparing and selecting synthetic data generators in for different problem settings. To this end, we develop Synthcity, an open-source Python library that allows researchers and practitioners to perform one-click benchmarking of synthetic data generators across data modalities and use cases. Beyond benchmarking, Synthcity serves as a centralized toolkit for accessing cutting-edge data generators. In addition, Synthcity's flexible plug-in style API makes it easy to incorporate additional data generators into the framework. Using examples of tabular data generation and data augmentation, we illustrate the general applicability of Synthcity, and the insight one can obtain.

benchmark framework, diverse use case, synthcity, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.76)
Information Technology > Software (0.59)

Add feedback

MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU

Huang, Yong, Yang, Zhongqi, Rahmani, Amir

arXiv.org Artificial IntelligenceOct-29-2025

Abstract--Sepsis is a leading cause of mortality in intensive care units (ICUs), yet existing research often relies on outdated datasets, non-reproducible preprocessing pipelines, and limited coverage of clinical interventions. We introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from the MIMIC-IV database, designed to support reproducible modeling of sepsis trajectories. Our cohort includes 35,239 ICU patients with time-aligned clinical variables and standardized treatment data, including vasopressors, fluids, mechanical ventilation and antibiotics. We describe a transparent preprocess-ing pipeline--based on Sepsis-3 criteria, structured imputation strategies, and treatment inclusion--and release it alongside benchmark tasks focused on early mortality prediction, length-of-stay estimation, and shock onset classification. Empirical results demonstrate that incorporating treatment variables substantially improves model performance, particularly for Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research. Sepsis is a life-threatening condition caused by the body's extreme response to an infection that can lead to organ failure and even death.

artificial intelligence, intervention, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2510.245

Country: North America > United States > California > Orange County > Irvine (0.15)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks

Imbriani, Manuela, Belmonte, Gina, Massink, Mieke, Tofani, Alessandro, Ciancia, Vincenzo

arXiv.org Artificial IntelligenceAug-19-2025

This paper presents preliminary results in the definition of a comprehensive benchmark framework designed to systematically evaluate spatial reasoning capabilities in neural networks, with a particular focus on morphological properties such as connectivity and distance relationships. The framework is currently being used to study the capabilities of nnU-Net, exploiting the spatial model checker V oxLogicA to generate two distinct categories of synthetic datasets: maze connectivity problems for topological analysis and spatial distance computation tasks for geometric understanding. Each category is evaluated across multiple resolutions to assess scalability and generalization properties. The automated pipeline encompasses a complete machine learning workflow including: synthetic dataset generation, standardized training with cross-validation, inference execution, and comprehensive evaluation using Dice coefficient and IoU (Intersection over Union) metrics. Preliminary experimental results demonstrate significant challenges in neural network spatial reasoning capabilities, revealing systematic failures in basic geometric and topological understanding tasks. The framework provides a reproducible experimental protocol, enabling researchers to identify specific limitations. Such limitations could be addressed through hybrid approaches combining neural networks with symbolic reasoning methods for improved spatial understanding in clinical applications, establishing a foundation for ongoing research into neural network spatial reasoning limitations and potential solutions.

artificial intelligence, machine learning, spatial reasoning, (17 more...)

arXiv.org Artificial Intelligence

2508.12741

Country:

North America > United States (0.28)
Europe > Italy > Tuscany (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Synthcity: a benchmark framework for diverse use cases of tabular synthetic data

Neural Information Processing SystemsOct-9-2024, 13:34:22 GMT

data generator, synthcity, use case, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.80)

Add feedback

Advancing Machine Learning in Industry 4.0: Benchmark Framework for Rare-event Prediction in Chemical Processes

Sudarshan, Vikram, Seider, Warren D.

arXiv.org Artificial IntelligenceAug-31-2024

Previously, using forward-flux sampling (FFS) and machine learning (ML), we developed multivariate alarm systems to counter rare un-postulated abnormal events. Our alarm systems utilized ML-based predictive models to quantify committer probabilities as functions of key process variables (e.g., temperature, concentrations, and the like), with these data obtained in FFS simulations. Herein, we introduce a novel and comprehensive benchmark framework for rare-event prediction, comparing ML algorithms of varying complexity, including Linear Support-Vector Regressor and k-Nearest Neighbors, to more sophisticated algorithms, such as Random Forests, XGBoost, LightGBM, CatBoost, Dense Neural Networks, and TabNet. This evaluation uses comprehensive performance metrics, such as: $\textit{RMSE}$, model training, testing, hyperparameter tuning and deployment times, and number and efficiency of alarms. These balance model accuracy, computational efficiency, and alarm-system efficiency, identifying optimal ML strategies for predicting abnormal rare events, enabling operators to obtain safer and more reliable plant operations.

artificial intelligence, machine learning, survey article, (17 more...)

arXiv.org Artificial Intelligence

2409.00485

Country: North America > United States > Pennsylvania (0.28)

Genre:

Research Report (0.63)
Overview (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Materials > Chemicals (0.94)
Health & Medicine > Therapeutic Area (0.93)
Energy > Oil & Gas > Upstream (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)

Add feedback

Causality for Tabular Data Synthesis: A High-Order Structure Causal Benchmark Framework

Tu, Ruibo, Senane, Zineb, Cao, Lele, Zhang, Cheng, Kjellström, Hedvig, Henter, Gustav Eje

arXiv.org Artificial IntelligenceJul-5-2024

Tabular synthesis models remain ineffective at capturing complex dependencies, and the quality of synthetic data is still insufficient for comprehensive downstream tasks, such as prediction under distribution shifts, automated decision-making, and cross-table understanding. A major challenge is the lack of prior knowledge about underlying structures and high-order relationships in tabular data. We argue that a systematic evaluation on high-order structural information for tabular data synthesis is the first step towards solving the problem. In this paper, we introduce high-order structural causal information as natural prior knowledge and provide a benchmark framework for the evaluation of tabular synthesis models. The framework allows us to generate benchmark datasets with a flexible range of data generation processes and to train tabular synthesis models using these datasets for further evaluation. We propose multiple benchmark tasks, high-order metrics, and causal inference tasks as downstream tasks for evaluating the quality of synthetic data generated by the trained models. Our experiments demonstrate to leverage the benchmark framework for evaluating the model capability of capturing high-order structural causal information. Furthermore, our benchmarking results provide an initial assessment of state-of-the-art tabular synthesis models. They have clearly revealed significant gaps between ideal and actual performance and how baseline methods differ. Our benchmark framework is available at URL https://github.com/TURuibo/CauTabBench.

causal information, dataset, information, (14 more...)

arXiv.org Artificial Intelligence

2406.08311

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Evaluating and Enhancing Large Language Models Performance in Domain-specific Medicine: Osteoarthritis Management with DocOA

Chen, Xi, You, MingKe, Wang, Li, Liu, WeiZhi, Fu, Yu, Xu, Jie, Zhang, Shaoting, Chen, Gang, Li, Kang, Li, Jian

arXiv.org Artificial IntelligenceJan-19-2024

The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. This study focused on evaluating and enhancing the clinical capabilities of LLMs in specific domains, using osteoarthritis (OA) management as a case study. A domain specific benchmark framework was developed, which evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM tailored for OA management that integrates retrieval-augmented generation (RAG) and instruction prompts, was developed. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results showed that general LLMs like GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. This study introduces a novel benchmark framework which assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.

evaluation, knowledge, llm, (17 more...)

arXiv.org Artificial Intelligence

2401.12998

Country:

Asia > China > Sichuan Province > Chengdu (0.05)
Asia > China > Shanghai > Shanghai (0.04)
North America > United States > New Jersey > Hudson County > Hoboken (0.04)
(2 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Rheumatology (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Empirical Study of Explainable AI Techniques on Deep Learning Models For Time Series Tasks

Schlegel, Udo, Oelke, Daniela, Keim, Daniel A., El-Assady, Mennatallah

arXiv.org Artificial IntelligenceDec-8-2020

Decision explanations of machine learning black-box models are often generated by applying Explainable AI (XAI) techniques. However, many proposed XAI methods produce unverified outputs. Evaluation and verification are usually achieved with a visual interpretation by humans on individual images or text. In this preregistration, we propose an empirical study and benchmark framework to apply attribution methods for neural networks developed for images and text data on time series. We present a methodology to automatically evaluate and rank attribution techniques on time series using perturbation methods to identify reliable approaches.

attribution, time point, xai technique, (12 more...)

arXiv.org Artificial Intelligence

2012.04344

Country:

North America > United States (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Industry:

Transportation (0.49)
Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

An Open Source AutoML Benchmark

Gijsbers, Pieter, LeDell, Erin, Thomas, Janek, Poirier, Sébastien, Bischl, Bernd, Vanschoren, Joaquin

arXiv.org Machine LearningJul-1-2019

In recent years, an active field of research has developed around automated machine learning (AutoML). Unfortunately, comparing different AutoML systems is hard and often done incorrectly. We introduce an open, ongoing, and extensible benchmark framework which follows best practices and avoids common mistakes. The framework is open-source, uses public datasets and has a website with up-to-date results. We use the framework to conduct a thorough comparison of 4 AutoML systems across 39 datasets and analyze the results.

automl tool, benchmark, dataset, (14 more...)

arXiv.org Machine Learning

1907.00909

Country:

North America > United States > New York > New York County > New York City (0.05)
Europe > Portugal > Porto > Porto (0.04)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.41)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback