AITopics | unified benchmark

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Neural Information Processing SystemsDec-26-2025, 03:13:07 GMT

Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari.However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity.In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios.

lightzero, monte carlo tree search, unified benchmark, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

Add feedback

OpenDataVal: a Unified Benchmark for Data Valuation

Neural Information Processing SystemsDec-25-2025, 11:37:50 GMT

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms.

data valuation algorithm, opendataval, unified benchmark, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

Peng, Yuezhang, Cai, Chonghao, Liu, Ziang, Fan, Shuai, Jiang, Sheng, Xu, Hua, Liu, Yuxin, Chen, Qiguang, Xu, Kele, Li, Yao, Wang, Sheng, Qin, Libo, Chen, Xie

arXiv.org Artificial IntelligenceDec-2-2025

ABSTRACT Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2512.01603

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models

Chen, Chen, Hu, ZeYang, Chen, Fengjiao, Ma, Liya, Liu, Jiaxing, Li, Xiaoyu, Wang, Ziwen, Cao, Xuezhi, Cai, Xunliang

arXiv.org Artificial IntelligenceOct-31-2025

Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model's intelligence evolution. In this work, we introduce a novel, high-quality, and UNified Omni model benchmark, UNO-Bench. This benchmark is designed to effectively evaluate both UNi-modal and Omni-modal capabilities under a unified ability taxonomy, spanning 44 task types and 5 modality combinations. It includes 1250 human curated samples for omni-modal with 98% cross-modality solvability, and 2480 enhanced uni-modal samples. The human-generated dataset is well-suited to real-world scenarios, particularly within the Chinese context, whereas the automatically compressed dataset offers a 90% increase in speed and maintains 98% consistency across 18 public benchmarks. In addition to traditional multi-choice questions, we propose an innovative multi-step open-ended question format to assess complex reasoning. A general scoring model is incorporated, supporting 6 question types for automated evaluation with 95% accuracy. Experimental result shows the Compositional Law between omni-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

benchmark, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.18915

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(2 more...)

Add feedback

DECO-Bench: Unified Benchmark for Decoupled Task-Agnostic Synthetic Data Release

Neural Information Processing SystemsMay-27-2025, 16:41:14 GMT

In this work, we tackle the question of how to systematically benchmark task-agnostic decoupling methods for privacy-preserving machine learning (ML). Sharing datasets that include sensitive information often triggers privacy concerns, necessitating robust decoupling methods to separate sensitive and non-sensitive attributes. Despite the development of numerous decoupling techniques, a standard benchmark for systematically comparing these methods remains absent. Using our framework, we benchmark various decoupling techniques and evaluate their privacy-utility trade-offs. Finally, we release our source code, pre-trained models, datasets of decoupled representations to foster research in this area.

deco-bench, decoupled task-agnostic synthetic data release, unified benchmark, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Security & Privacy (0.68)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Machine Learning (0.48)

Add feedback

OpenDataVal: a Unified Benchmark for Data Valuation

Neural Information Processing SystemsMay-26-2025, 23:20:36 GMT

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.

data valuation algorithm, opendataval, unified benchmark, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Neural Information Processing SystemsJan-19-2025, 08:10:14 GMT

Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari.However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity.In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios.

general sequential decision scenario, lightzero, monte carlo tree search, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.65)

Add feedback

OpenDataVal: a Unified Benchmark for Data Valuation

Neural Information Processing SystemsJan-18-2025, 14:40:29 GMT

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.

data valuation algorithm, opendataval, unified benchmark, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Niu, Yazhe, Pu, Yuan, Yang, Zhenjie, Li, Xueyan, Zhou, Tong, Ren, Jiyuan, Hu, Shuai, Li, Hongsheng, Liu, Yu

arXiv.org Artificial IntelligenceOct-12-2023

Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari. However, it has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications, especially when these environments involve complex action spaces and significant simulation costs, or inherent stochasticity. In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios. Specificially, we summarize the most critical challenges in designing a general MCTS-style decision-making solver, then decompose the tightly-coupled algorithm and system design of tree-search RL methods into distinct sub-modules. By incorporating more appropriate exploration and optimization strategies, we can significantly enhance these sub-modules and construct powerful LightZero agents to tackle tasks across a wide range of domains, such as board games, Atari, MuJoCo, MiniGrid and GoBigger. Detailed benchmark results reveal the significant potential of such methods in building scalable and efficient decision intelligence. The code is available as part of OpenDILab at https://github.com/opendilab/LightZero.

general sequential decision scenario, monte carlo tree search, unified benchmark, (1 more...)

arXiv.org Artificial Intelligence

2310.08348

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (0.60)

Add feedback

Genenerative AI Models In Small Molecule Drug Discovery: The Open Challenge To Create A Unified Benchmark

#artificialintelligenceFeb-12-2018, 17:02:14 GMT

Generative AI models in chemistry are increasingly popular in the research community, mainly, due to their interest for drug discovery applications. They generate virtual molecules with desired chemical and biological properties (more details in this blog post). However, this flourishing literature still lacks a unified benchmark. Such benchmark would provide a common framework to evaluate and compare different generative models. Moreover, it would help to formulate best practices for this emerging industry of'AI molecule generators': how much training data is needed, for how long the model should be trained, and so on.

benchmark, machine learning, natural language, (18 more...)

#artificialintelligence

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Generation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.72)

Add feedback

Filters

Collaborating Authors

unified benchmark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

OpenDataVal: a Unified Benchmark for Data Valuation

MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

UNO-Bench: A Unified Benchmark for Exploring the Compositional Law Between Uni-modal and Omni-modal in Omni Models

DECO-Bench: Unified Benchmark for Decoupled Task-Agnostic Synthetic Data Release

OpenDataVal: a Unified Benchmark for Data Valuation

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

OpenDataVal: a Unified Benchmark for Data Valuation

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

Genenerative AI Models In Small Molecule Drug Discovery: The Open Challenge To Create A Unified Benchmark