AITopics | evaluation method

Collaborating Authors

evaluation method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Entropy-Calibrated Label Distribution Learning

Neural Information Processing SystemsJun-20-2026, 13:02:48 GMT

Label Distribution Learning (LDL) has emerged as a powerful framework for estimating complete conditional label distributions, providing crucial reliability for risk-sensitive decision-making tasks. While existing LDL algorithms exhibit competent performance under the conventional LDL performance evaluation methods, two key limitations remain: (1) current algorithms systematically underperform on the samples with low-entropy label distributions, which can be particularly valuable for decision making, and (2) the conventional performance evaluation methods are inherently biased due to the numerical imbalance of samples. In this paper, through empirical and theoretical analyses, we find that excessive cohesion between anchor vectors contributes significantly to the observed entropy bias phenomenon in LDL algorithms. Accordingly, we propose an inter-anchor angular regularization term that mitigates cohesion among anchor vectors by penalizing over-small angles. Besides, to alleviate the numerical imbalance of high-entropy samples in test set, we propose an entropy-calibrated aggregation strategy that obtains the overall model performance by evaluating performance on the low-entropy and high-entropy subsets of the overall test set separately. Finally, we conduct extensive experiments on various real-world datasets to demonstrate the effectiveness of our proposal.

artificial intelligence, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: Asia > China (0.47)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(4 more...)

Add feedback

On Evaluating Policies for Robust POMDPs

Neural Information Processing SystemsJun-17-2026, 07:36:55 GMT

Robust partially observable Markov decision processes (RPOMDPs) model sequential decision-making problems under partial observability, where an agent must be robust against a range of dynamics. RPOMDPs can be viewed as a two-player game between an agent, who selects actions, and nature, who adversarially selects the dynamics. Evaluating an agent policy requires finding an adversarial nature policy, which is computationally challenging. In this paper, we advance the evaluation of agent policies for RPOMDPs in three ways. First, we discuss suitable benchmarks.

artificial intelligence, machine learning, nature policy, (16 more...)

Neural Information Processing Systems

Country: Europe > Netherlands (0.28)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.67)

Industry: Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

AVERIMATEC: ADataset for Automatic Verification of Image-Text Claims with Evidence from the Web

Neural Information Processing SystemsJun-14-2026, 11:22:04 GMT

Textual claims are often accompanied by images to enhance their credibility and spread on social media, but this also raises concerns about the spread of misinformation. Existing datasets for automated verification of image-text claims remain limited, as they often consist of synthetic claims and lack evidence annotations to capture the reasoning behind the verdict. In this work, we introduce AVERIMATEC, a dataset consisting of 1,297 real-world image-text claims. Each claim is annotated with question-answer (QA) pairs containing evidence from the web, reflecting a decomposed reasoning regarding the verdict. We mitigate common challenges in fact-checking datasets such as contextual dependence, temporal leakage, and evidence insufficiency, via claim normalization, temporally constrained evidence annotation, and a two-stage sufficiency check. We assess the consistency of the annotation in AVERIMATEC via inter-annotator studies, achieving a κ = 0.742 on verdicts and 74.7% consistency on QA pairs. We also propose a novel evaluation method for evidence retrieval and conduct extensive experiments to establish baselines for verifying image-text claims using open-web evidence.

large language model, machine learning, question answering, (25 more...)

Neural Information Processing Systems

Country: North America > United States (0.67)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Media > News (1.00)
Information Technology > Security & Privacy (1.00)
Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Social Media (1.00)
(6 more...)

Add feedback

Robustness of Refugee-Matching Gains to Off-Policy Evaluation Choices

Bansak, Kirk, Paulson, Elisabeth, Rothenhäusler, Dominik, Ferwerda, Jeremy, Hainmueller, Jens, Hotard, Michael

arXiv.org Machine LearningMay-11-2026

Previous research has investigated the potential of refugee matching for boosting refugee outcomes, first considered by Bansak et al. (2018). This paper demonstrates the stability of counterfactual impact evaluation results in the context of refugee matching in the United States using a range of off-policy evaluation methods. In order to estimate counterfactual impact and test the robustness of our results, we employ several evaluation methods, including inverse probability weighting (IPW) and multiple variants of augmented inverse probability weighting (AIPW). We also consider various modifications, including alternative modeling architectures and different assignment procedures. The impact estimates remain consistent in magnitude in all scenarios as well as statistically significant in most cases. Furthermore, the estimates are also consistent with the results originally presented in Bansak et al. (2018).

artificial intelligence, assignment, machine learning, (14 more...)

arXiv.org Machine Learning

2605.06686

Country: North America > United States (0.49)

Genre: Research Report > New Finding (0.66)

Industry:

Government > Immigration & Customs (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.90)
Government > Regional Government (0.90)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis

Neural Information Processing SystemsApr-30-2026, 02:54:31 GMT

This work studies the evaluation of explaining graph neural networks (GNNs), which is crucial to the credibility of post-hoc explainability in practical usage. Conventional evaluation metrics, and even explanation methods -- which mainly follow the paradigm of feeding the explanatory subgraph to the model and measuring output difference -- mostly suffer from the notorious out-of-distribution (OOD) issue. Hence, in this work, we endeavor to confront this issue by introducing a novel evaluation metric, termed OOD-resistant Adversarial Robustness (OAR). Specifically, we draw inspiration from adversarial robustness and evaluate post-hoc explanation subgraphs by calculating their robustness under attack. On top of that, an elaborate OOD reweighting block is inserted into the pipeline to confine the evaluation process to the original data distribution. For applications involving large datasets, we further devise a Simplified version of OAR (SimOAR), which achieves a significant improvement in computational efficiency at the cost of a small amount of performance.

artificial intelligence, explanation, machine learning, (16 more...)

Neural Information Processing Systems

Country:

Europe (0.93)
North America > United States (0.46)
North America > Canada (0.28)
Asia > China (0.28)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Neural Information Processing SystemsMar-22-2026, 08:24:59 GMT

In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through A vs B paired comparisons.However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct an extensive evaluation of Elo behavior across simulated and real-world scenarios, demonstrating that individual Elo computations can exhibit significant volatility.We show that both axioms are not always satisfied, raising questions about the reliability of current comparative evaluations of LLMs.If the current use of Elo scores is intended to substitute the costly head-to-head comparison of LLMs, it is crucial to ensure the ranking is as robust as possible.Guided by the axioms, our findings offer concrete guidelines for enhancing the reliability of LLM evaluation methods, suggesting a need for reassessment of existing comparative approaches.

artificial intelligence, large language model, natural language, (7 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Chess (0.60)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback