AITopics

doi: 10.1145/3715275.3732071

2506.07281

Country:

Europe > United Kingdom > England (0.46)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine (0.94)
Social Sector (0.93)
Law (0.93)
(2 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.67)

Vickers, Peter, Barrault, Loïc, Monti, Emilio, Aletras, Nikolaos

We Need to Talk About Classification Evaluation Metrics in NLP

arXiv.org Artificial IntelligenceJan-8-2024

In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.

accuracy, classifier, informedness, (14 more...)

2401.03831

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Ireland (0.04)
Europe > France (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.66)

Udandarao, Vishaal, Burg, Max F., Albanie, Samuel, Bethge, Matthias

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models

arXiv.org Artificial IntelligenceDec-6-2023

Recent advances in the development of vision-language models (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.

arxiv preprint arxiv, dataset, learning, (12 more...)

2310.08577

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report > New Finding (0.88)

Industry:

Transportation > Ground > Road (0.67)
Automobiles & Trucks (0.67)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(4 more...)

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

arXiv.org Machine LearningOct-10-2020

Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.

artificial intelligence, informedness, machine learning, (18 more...)

2010.16061

Country:

North America > United States > New York (0.04)
Oceania > Australia > South Australia (0.04)
North America > United States > District of Columbia > Washington (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

ADABOOK & MULTIBOOK: Adaptive Boosting with Chance Correction

arXiv.org Machine LearningOct-10-2020

There has been considerable interest in boosting and bagging, including the combination of the adaptive techniques of AdaBoost with the random selection with replacement techniques of Bagging. At the same time there has been a revisiting of the way we evaluate, with chance-corrected measures like Kappa, Informedness, Correlation or ROC AUC being advocated. This leads to the question of whether learning algorithms can do better by optimizing an appropriate chance corrected measure. Indeed, it is possible for a weak learner to optimize Accuracy to the detriment of the more reaslistic chance-corrected measures, and when this happens the booster can give up too early. This phenomenon is known to occur with conventional Accuracy-based AdaBoost, and the MultiBoost algorithm has been developed to overcome such problems using restart techniques based on bagging. This paper thus complements the theoretical work showing the necessity of using chance-corrected measures for evaluation, with empirical work showing how use of a chance-corrected measure can improve boosting. We show that the early surrender problem occurs in MultiBoost too, in multiclass situations, so that chance-corrected AdaBook and Multibook can beat standard Multiboost or AdaBoost, and we further identify which chance-corrected measures to use when.

artificial intelligence, learner, machine learning, (17 more...)

2010.1555

Country:

Asia > China > Beijing > Beijing (0.05)
Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)

arXiv.org Artificial IntelligenceOct-10-2020

A computationally and cognitively plausible model of supervised and unsupervised learning

Powers, David M W

Both empirical and mathematical demonstrations of the importance of chance-corrected measures are discussed, and a new model of learning is proposed based on empirical psychological results on association learning. Two forms of this model are developed, the Informatron as a chance-corrected Perceptron, and AdaBook as a chance-corrected AdaBoost procedure. Computational results presented show chance correction facilitates learning.

artificial intelligence, machine learning, probability, (15 more...)

2010.14618

Country:

Asia > China > Beijing > Beijing (0.05)
Oceania > Australia > South Australia > Adelaide (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Silvestro, Daniele, Andermann, Tobias

Prior choice affects ability of Bayesian neural networks to identify unknowns

arXiv.org Artificial IntelligenceMay-11-2020

Deep Bayesian neural networks (BNNs) are a powerful tool, though computationally demanding, to perform parameter estimation while jointly estimating uncertainty around predictions. BNNs are typically implemented using arbitrary normal-distributed prior distributions on the model parameters. Here, we explore the effects of different prior distributions on classification tasks in BNNs and evaluate the evidence supporting the predictions based on posterior probabilities approximated by Markov Chain Monte Carlo sampling and by computing Bayes factors. We show that the choice of priors has a substantial impact on the ability of the model to confidently assign data to the correct class (true positive rates). Prior choice also affects significantly the ability of a BNN to identify out-of-distribution instances as unknown (false positive rates). When comparing our results against neural networks (NN) with Monte Carlo dropout we found that BNNs generally outperform NNs. Finally, in our tests we did not find a single best choice as prior distribution. Instead, each dataset yielded the best results under a different prior, indicating that testing alternative options can improve the performance of BNNs.

artificial intelligence, dataset, machine learning, (19 more...)

2005.04987

Country:

Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Europe > Switzerland > Fribourg > Fribourg (0.04)

Genre: Research Report > New Finding (0.34)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Epidemiology (1.00)

Visualization of Tradeoff in Evaluation: from Precision-Recall & PN to LIFT, ROC & BIRD

arXiv.org Machine LearningMay-3-2015

Evaluation often aims to reduce the correctness or error characteristics of a system down to a single number, but that always involves trade-offs. Another way of dealing with this is to quote two numbers, such as Recall and Precision, or Sensitivity and Specificity. But it can also be useful to see more than this, and a graphical approach can explore sensitivity to cost, prevalence, bias, noise, parameters and hyper-parameters. Moreover, most techniques are implicitly based on two balanced classes, and our ability to visualize graphically is intrinsically two dimensional, but we often want to visualize in a multiclass context. We review the dichotomous approaches relating to Precision, Recall, and ROC as well as the related LIFT chart, exploring how they handle unbalanced and multiclass data, and deriving new probabilistic and information theoretic variants of LIFT that help deal with the issues associated with the handling of multiple and unbalanced classes.

artificial intelligence, machine learning, prevalence, (18 more...)

1505.00401

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Evaluation Evaluation a Monte Carlo study

arXiv.org Machine LearningApr-3-2015

Over the last decade there has been increasing concern about the biases embodied in traditional evaluation methods for Natural Language Processing/Learning, particularly methods borrowed from Information Retrieval. Without knowledge of the Bias and Prevalence of the contingency being tested, or equivalently the expectation due to chance, the simple conditional probabilities Recall, Precision and Accuracy are not meaningful as evaluation measures, either individually or in combinations such as F-factor. The existence of bias in NLP measures leads to the 'improvement' of systems by increasing their bias, such as the practice of improving tagging and parsing scores by using most common value (e.g. water is always a Noun) rather than the attempting to discover the correct one. The measures Cohen Kappa and Powers Informedness are discussed as unbiased alternative to Recall and related to the psychologically significant measure DeltaP. In this paper we will analyze both biased and unbiased measures theoretically, characterizing the precise relationship between all these measures as well as evaluating the evaluation measures themselves empirically using a Monte Carlo simulation.

artificial intelligence, machine learning, natural language, (21 more...)

1504.00854

Country: Oceania > Australia (0.28)

Genre:

Research Report (0.50)
Overview (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Staab, Eugen, Caminada, Martin

Assessing the Impact of Informedness on a Consultant's Profit

arXiv.org Artificial IntelligenceSep-4-2009

We study the notion of informedness in a client-consultant setting. Using a software simulator, we examine the extent to which it pays off for consultants to provide their clients with advice that is well-informed, or with advice that is merely meant to appear to be well-informed. The latter strategy is beneficial in that it costs less resources to keep up-to-date, but carries the risk of a decreased reputation if the clients discover the low level of informedness of the consultant. Our experimental results indicate that under different circumstances, different strategies yield the optimal results (net profit) for the consultants.

argument, artificial intelligence, natural language, (14 more...)

0909.0901

Country: Europe (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.33)