AITopics | evaluation criterion

Collaborating Authors

evaluation criterion

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Li, Kai, Shen, Can, Liu, Yile, Han, Jirui, Zheng, Kelong, Zou, Xuechao, Wang, Zhe, Zhang, Shun, Du, Xingjian, Luo, Hanjun, Jin, Yingbin, Xing, Xinxin, Ma, Ziyang, Liu, Yue, Zhang, Yifan, Fang, Junfeng, Wang, Kun, Yan, Yibo, Deng, Gelei, Li, Haoyang, Li, Yiming, Zhuang, Xiaobin, Chen, Tianlong, Wen, Qingsong, Zhang, Tianwei, Liu, Yang, Hu, Haibo, Wu, Zhizheng, Hu, Xiaolin, Chng, Eng-Siong, Xu, Wenyuan, Wang, XiaoFeng, Dong, Wei, Li, Xinfeng

arXiv.org Artificial IntelligenceOct-1-2025

Audio Large Language Models (ALLMs) have gained widespread adoption, yet their trustworthiness remains underexplored. Existing evaluation frameworks, designed primarily for text, fail to address unique vulnerabilities introduced by audio's acoustic properties. We identify significant trustworthiness risks in ALLMs arising from non-semantic acoustic cues, including timbre, accent, and background noise, which can manipulate model behavior. We propose AudioTrust, a comprehensive framework for systematic evaluation of ALLM trustworthiness across audio-specific risks. AudioTrust encompasses six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. The framework implements 26 distinct sub-tasks using a curated dataset of over 4,420 audio samples from real-world scenarios, including daily conversations, emergency calls, and voice assistant interactions. We conduct comprehensive evaluations across 18 experimental configurations using human-validated automated pipelines. Our evaluation of 14 state-of-the-art open-source and closed-source ALLMs reveals significant limitations when confronted with diverse high-risk audio scenarios, providing insights for secure deployment of audio models. Code and data are available at https://github.com/JusperLee/AudioTrust.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2505.16211

Country: Asia > China (0.45)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Banking & Finance (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability

Hu, Xinyu, Gao, Mingqi, Lin, Li, Yu, Zhenghan, Wan, Xiaojun

arXiv.org Artificial IntelligenceFeb-17-2025

In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.12052

Country:

Asia (0.93)
Europe > Austria (0.28)
North America > Mexico (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Can LLMs be Good Graph Judger for Knowledge Graph Construction?

Huang, Haoyu, Chen, Chong, He, Conghui, Li, Yang, Jiang, Jiawei, Zhang, Wentao

arXiv.org Artificial IntelligenceNov-26-2024

In real-world scenarios, most of the data obtained from information retrieval (IR) system is unstructured. Converting natural language sentences into structured Knowledge Graphs (KGs) remains a critical challenge. The quality of constructed KGs may also impact the performance of some KG-dependent domains like GraphRAG systems and recommendation systems. Recently, Large Language Models (LLMs) have demonstrated impressive capabilities in addressing a wide range of natural language processing tasks. However, there are still challenges when utilizing LLMs to address the task of generating structured KGs. And we have identified three limitations with respect to existing KG construction methods. (1)There is a large amount of information and excessive noise in real-world documents, which could result in extracting messy information. (2)Native LLMs struggle to effectively extract accuracy knowledge from some domain-specific documents. (3)Hallucinations phenomenon cannot be overlooked when utilizing LLMs directly as an unsupervised method for constructing KGs. In this paper, we propose GraphJudger, a knowledge graph construction framework to address the aforementioned challenges. We introduce three innovative modules in our method, which are entity-centric iterative text denoising, knowledge aware instruction tuning and graph judgement, respectively. We seek to utilize the capacity of LLMs to function as a graph judger, a capability superior to their role only as a predictor for KG construction problems. Experiments conducted on two general text-graph pair datasets and one domain-specific text-graph pair dataset show superior performances compared to baseline methods. The code of our proposed method is available at https://github.com/hhy-huang/GraphJudger.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2411.17388

Country:

Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hubei Province > Wuhan (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Developing Guidelines for Functionally-Grounded Evaluation of Explainable Artificial Intelligence using Tabular Data

Velmurugan, Mythreyi, Ouyang, Chun, Xu, Yue, Sindhgatta, Renuka, Wickramanayake, Bemali, Moreira, Catarina

arXiv.org Artificial IntelligenceSep-30-2024

Explainable Artificial Intelligence (XAI) techniques are used to provide transparency to complex, opaque predictive models. However, these techniques are often designed for image and text data, and it is unclear how fit-for-purpose they are when applied to tabular data. As XAI techniques are rarely evaluated in settings with tabular data, the applicability of existing evaluation criteria and methods are also unclear and needs (re-)examination. For example, some works suggest that evaluation methods may unduly influence the evaluation results when using tabular data. This lack of clarity on evaluation procedures can lead to reduced transparency and ineffective use of XAI techniques in real world settings. In this study, we examine literature on XAI evaluation to derive guidelines on functionally-grounded assessment of local, post hoc XAI techniques. We identify 20 evaluation criteria and associated evaluation methods, and derive guidelines on when and how each criterion should be evaluated. We also identify key research gaps to be addressed by future work. Our study contributes to the body of knowledge on XAI evaluation through in-depth examination of functionally-grounded XAI evaluation protocols, and has laid the groundwork for future research on XAI evaluation.

data mining, explanation, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2410.12803

Country:

Europe > Switzerland (0.04)
Oceania > Australia > South Australia (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
(7 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.48)

Industry:

Health & Medicine (0.67)
Education > Curriculum > Subject-Specific Education (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
(2 more...)

Add feedback

The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems

Sieker, Judith, Junker, Simeon, Utescher, Ronja, Attari, Nazia, Wersing, Heiko, Buschmeier, Hendrik, Zarrieß, Sina

arXiv.org Artificial IntelligenceJun-27-2024

We examine how users perceive the limitations of an AI system when it encounters a task that it cannot perform perfectly and whether providing explanations alongside its answers aids users in constructing an appropriate mental model of the system's capabilities and limitations. We employ a visual question answer and explanation task where we control the AI system's limitations by manipulating the visual inputs: during inference, the system either processes full-color or grayscale images. Our goal is to determine whether participants can perceive the limitations of the system. We hypothesize that explanations will make limited AI capabilities more transparent to users. However, our results show that explanations do not have this effect. Instead of allowing users to more accurately assess the limitations of the AI system, explanations generally increase users' perceptions of the system's competence - regardless of its actual performance.

exp, explanation, participant, (15 more...)

arXiv.org Artificial Intelligence

2406.1917

Country:

North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.74)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Feature Selection using Wrapper Method with Python Implementation

#artificialintelligenceOct-31-2020, 11:50:08 GMT

In today's era of Big data and IoT, we are easily loaded with rich datasets having extremely high dimensions. In order to perform any machine learning task or to get insights from such high dimensional data, feature selection becomes very important. Increase in complexity of a model and makes it harder to interpret. Increase in time complexity for a model to get trained. Hence, it gives an indispensable need to perform feature selection.

artificial intelligence, backward elimination, machine learning, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Feature selection using Wrapper methods in Python

#artificialintelligenceJan-17-2020, 20:43:09 GMT

In today's era of Big data and IoT, we are easily loaded with rich datasets having extremely high dimensions. In order to perform any machine learning task or to get insights from such high dimensional data, feature selection becomes very important. Hence, it gives an indispensable need to perform feature selection. Feature selection is very crucial and must component in machine learning and data science workflows especially while dealing with high dimensional datasets. As the name suggests, it is a process of selecting the most significant and relevant features from a vast set of features in the given dataset.

backward elimination, elimination, selection, (12 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Similarity-based Multi-label Learning

Rossi, Ryan A., Ahmed, Nesreen K., Eldardiry, Hoda, Zhou, Rong

arXiv.org Machine LearningOct-27-2017

Multi-label classification is an important learning problem with many applications. In this work, we propose a principled similarity-based approach for multi-label learning called SML. We also introduce a similarity-based approach for predicting the label set size. The experimental results demonstrate the effectiveness of SML for multi-label classification where it is shown to compare favorably with a wide variety of existing algorithms across a range of evaluation criterion.

algorithm, classification, learning, (10 more...)

arXiv.org Machine Learning

1710.10335

Country: North America > United States > California > Santa Clara County > Palo Alto (0.05)

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Mining Brain Networks using Multiple Side Views for Neurological Disorder Identification

Cao, Bokai, Kong, Xiangnan, Zhang, Jingyuan, Yu, Philip S., Ragin, Ann B.

arXiv.org Machine LearningAug-19-2015

Mining discriminative subgraph patterns from graph data has attracted great interest in recent years. It has a wide variety of applications in disease diagnosis, neuroimaging, etc. Most research on subgraph mining focuses on the graph representation alone. However, in many real-world applications, the side information is available along with the graph data. For example, for neurological disorder identification, in addition to the brain networks derived from neuroimaging data, hundreds of clinical, immunologic, serologic and cognitive measures may also be documented for each subject. These measures compose multiple side views encoding a tremendous amount of supplemental information for diagnostic purposes, yet are often ignored. In this paper, we study the problem of discriminative subgraph selection using multiple side views and propose a novel solution to find an optimal set of subgraph features for graph classification by exploring a plurality of side views. We derive a feature evaluation criterion, named gSide, to estimate the usefulness of subgraph patterns based upon side views. Then we develop a branch-and-bound algorithm, called gMSV, to efficiently search for optimal subgraph features by integrating the subgraph mining process and the procedure of discriminative feature selection. Empirical studies on graph classification tasks for neurological disorders using brain networks demonstrate that subgraph patterns selected by the multi-side-view guided subgraph selection approach can effectively boost graph classification performances and are relevant to disease diagnosis.

artificial intelligence, data mining, machine learning, (16 more...)

arXiv.org Machine Learning

doi: 10.1109/ICDM.2015.50

1508.04554

Country:

North America > United States (0.28)
Asia > China (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)
Health & Medicine > Therapeutic Area > Immunology > HIV (0.47)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The BOSARIS Toolkit: Theory, Algorithms and Code for Surviving the New DCF

Brümmer, Niko, de Villiers, Edward

arXiv.org Machine LearningApr-10-2013

The change of two orders of magnitude in the 'new DCF' of NIST's SRE'10, relative to the 'old DCF' evaluation criterion, posed a difficult challenge for participants and evaluator alike. Initially, participants were at a loss as to how to calibrate their systems, while the evaluator underestimated the required number of evaluation trials. After the fact, it is now obvious that both calibration and evaluation require very large sets of trials. This poses the challenges of (i) how to decide what number of trials is enough, and (ii) how to process such large data sets with reasonable memory and CPU requirements. After SRE'10, at the BOSARIS Workshop, we built solutions to these problems into the freely available BOSARIS Toolkit. This paper explains the principles and algorithms behind this toolkit. The main contributions of the toolkit are: 1. The Normalized Bayes Error-Rate Plot, which analyses likelihood- ratio calibration over a wide range of DCF operating points. These plots also help in judging the adequacy of the sizes of calibration and evaluation databases. 2. Efficient algorithms to compute DCF and minDCF for large score files, over the range of operating points required by these plots. 3. A new score file format, which facilitates working with very large trial lists. 4. A faster logistic regression optimizer for fusion and calibration. 5. A principled way to define EER (equal error rate), which is of practical interest when the absolute error count is small.

artificial intelligence, machine learning, operating point, (16 more...)

arXiv.org Machine Learning

1304.2865

Country:

Europe > Spain (0.28)
North America > Canada (0.28)

Genre:

Research Report > New Finding (0.35)
Research Report > Experimental Study (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.31)

Add feedback