Goto

Collaborating Authors

 multiclass quantification


Kernel Density Estimation for Multiclass Quantification

arXiv.org Machine Learning

Quantification (variously called learning to quantify or class prevalence estimation) is the area of supervised machine learning concerned with estimating the percentages of instances from a population (hereafter, a bag of examples) belonging to each of the classes of interest [González et al., 2017, Esuli et al., 2023]. Quantification finds applications in many disciplines, like the social sciences, epidemiology, or market research, in which the interest lies at the aggregate level, i.e., in which inferring characteristics of the single individual (e.g., via classification, or via regression) is of little concern since knowing group-level information is all we need. Despite the fact that binary quantification (i.e., the setting in which the classes of interest are positive vs. negative) has been, by far, the most studied scenario in the quantification literature [Card and Smith, 2018, Forman, 2008, Bella et al., 2010, Esuli and Sebastiani, 2015, Hassan et al., 2020, Moreo and Sebastiani, 2021], the truth is that many of the applications of quantification naturally arise in the multiclass regime, i.e., in cases in which there are more than two mutually exclusive classes. Examples of multiclass settings are ubiquitous, and may include the allocation of human resources to different departments in a company [Forman, 2005], the analysis of different phytoplankton species that could exist in a water sample [González et al., 2019], or the analysis of the various causes of death studied in verbal autopsies [King and Lu, 2008], to name a few. A more concrete example could consist of providing answers to questions like: "What is the percentage of tweets conveying positive, neutral, and negative opinions concerning a specific hashtag?"


A Comparative Evaluation of Quantification Methods

arXiv.org Artificial Intelligence

Quantification represents the problem of predicting class distributions in a given target set. It also represents a growing research field in supervised machine learning, for which a large variety of different algorithms has been proposed in recent years. However, a comprehensive empirical comparison of quantification methods that supports algorithm selection is not available yet. In this work, we close this research gap by conducting a thorough empirical performance comparison of 24 different quantification methods. To consider a broad range of different scenarios for binary as well as multiclass quantification settings, we carried out almost 3 million experimental runs on 40 data sets. We observe that no single algorithm generally outperforms all competitors, but identify a group of methods including the Median Sweep and the DyS framework that perform significantly better in binary settings. For the multiclass setting, we observe that a different, broad group of algorithms yields good performance, including the Generalized Probabilistic Adjusted Count, the readme method, the energy distance minimization method, the EM algorithm for quantification, and Friedman's method. More generally, we find that the performance on multiclass quantification is inferior to the results obtained in the binary setting. Our results can guide practitioners who intend to apply quantification algorithms and help researchers to identify opportunities for future research.