prevalence value
Analysis of Diagnostics (Part II): Prevalence, Linear Independence, and Unsupervised Learning
Patrone, Paul N., Binder, Raquel A., Forconi, Catherine S., Moormann, Ann M., Kearsley, Anthony J.
This is the second manuscript in a two-part series that uses diagnostic testing to understand the connection between prevalence (i.e. number of elements in a class), uncertainty quantification (UQ), and classification theory. Part I considered the context of supervised machine learning (ML) and established a duality between prevalence and the concept of relative conditional probability. The key idea of that analysis was to train a family of discriminative classifiers by minimizing a sum of prevalence-weighted empirical risk functions. The resulting outputs can be interpreted as relative probability level-sets, which thereby yield uncertainty estimates in the class labels. This procedure also demonstrated that certain discriminative and generative ML models are equivalent. Part II considers the extent to which these results can be extended to tasks in unsupervised learning through recourse to ideas in linear algebra. We first observe that the distribution of an impure population, for which the class of a corresponding sample is unknown, can be parameterized in terms of a prevalence. This motivates us to introduce the concept of linearly independent populations, which have different but unknown prevalence values. Using this, we identify an isomorphism between classifiers defined in terms of impure and pure populations. In certain cases, this also leads to a nonlinear system of equations whose solution yields the prevalence values of the linearly independent populations, fully realizing unsupervised learning as a generalization of supervised learning. We illustrate our methods in the context of synthetic data and a research-use-only SARS-CoV-2 enzyme-linked immunosorbent assay (ELISA).
Regularization-Based Methods for Ordinal Quantification
Bunse, Mirko, Moreo, Alejandro, Sebastiani, Fabrizio, Senz, Martin
Quantification, i.e., the task of training predictors of the class prevalence values in sets of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multiclass problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of n>2 classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others' developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.
Multi-Label Quantification
Moreo, Alejandro, Francisco, Manuel, Sebastiani, Fabrizio
Quantification, variously called "supervised prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies (a.k.a. "prevalence values") of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but na\"ive, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the na\"ive approaches by a large margin. The code to reproduce all our experiments is available online.
QuaPy: A Python-Based Framework for Quantification
Moreo, Alejandro, Esuli, Andrea, Sebastiani, Fabrizio
QuaPy is an open-source framework for performing quantification (a.k.a. supervised prevalence estimation), written in Python. Quantification is the task of training quantifiers via supervised learning, where a quantifier is a predictor that estimates the relative frequencies (a.k.a. prevalence values) of the classes of interest in a sample of unlabelled data. While quantification can be trivially performed by applying a standard classifier to each unlabelled data item and counting how many data items have been assigned to each class, it has been shown that this "classify and count" method is outperformed by methods specifically designed for quantification. QuaPy provides implementations of a number of baseline methods and advanced quantification methods, of routines for quantification-oriented model selection, of several broadly accepted evaluation measures, and of robust evaluation protocols routinely used in the field. QuaPy also makes available datasets commonly used for testing quantifiers, and offers visualization tools for facilitating the analysis and interpretation of the results. The software is open-source and publicly available under a BSD-3 licence via https://github.com/HLT-ISTI/QuaPy, and can be installed via pip (https://pypi.org/project/QuaPy/)
Re-Assessing the "Classify and Count" Quantification Method
Moreo, Alejandro, Sebastiani, Fabrizio
Learning to quantify (a.k.a.\ quantification) is a task concerned with training unbiased estimators of class prevalence via supervised learning. This task originated with the observation that "Classify and Count" (CC), the trivial method of obtaining class prevalence estimates, is often a biased estimator, and thus delivers suboptimal quantification accuracy; following this observation, several methods for learning to quantify have been proposed that have been shown to outperform CC. In this work we contend that previous works have failed to use properly optimised versions of CC. We thus reassess the real merits of CC (and its variants), and argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy once (a) hyperparameter optimisation is performed, and (b) this optimisation is performed by using a true quantification loss instead of a standard classification-based loss. Experiments on three publicly available binary sentiment classification datasets support these conclusions.
Data-Driven Malaria Prevalence Prediction in Large Densely-Populated Urban Holoendemic sub-Saharan West Africa: Harnessing Machine Learning Approaches and 22-years of Prospectively Collected Data
Brown, Biobele J., Przybylski, Alexander A., Manescu, Petru, Caccioli, Fabio, Oyinloye, Gbeminiyi, Elmi, Muna, Shaw, Michael J., Pawar, Vijay, Claveau, Remy, Shawe-Taylor, John, Srinivasan, Mandayam A., Afolabi, Nathaniel K., Orimadegun, Adebola E., Ajetunmobi, Wasiu A., Akinkunmi, Francis, Kowobari, Olayinka, Osinusi, Kikelomo, Akinbami, Felix O., Omokhodion, Samuel, Shokunbi, Wuraola A., Lagunju, Ikeoluwa, Sodeinde, Olugbemiro, Fernandez-Reyes, Delmiro
Plasmodium falciparum malaria still poses one of the greatest threats to human life with over 200 million cases globally leading to half-million deaths annually. Of these, 90% of cases and of the mortality occurs in sub-Saharan Africa, mostly among children. Although malaria prediction systems are central to the 2016-2030 malaria Global Technical Strategy, currently these are inadequate at capturing and estimating the burden of disease in highly endemic countries. We developed and validated a computational system that exploits the predictive power of current Machine Learning approaches on 22-years of prospective data from the high-transmission holoendemic malaria urban-densely-populated sub-Saharan West-Africa metropolis of Ibadan. Our dataset of >9x104 screened study participants attending our clinical and community services from 1996 to 2017 contains monthly prevalence, temporal, environmental and host features. Our Locality-specific Elastic-Net based Malaria Prediction System (LEMPS) achieves good generalization performance, both in magnitude and direction of the prediction, when tasked to predict monthly prevalence on previously unseen validation data (MAE<=6x10-2, MSE<=7x10-3) within a range of (+0.1 to -0.05) error-tolerance which is relevant and usable for aiding decision-support in a holoendemic setting. LEMPS is well-suited for malaria prediction, where there are multiple features which are correlated with one another, and trading-off between regularization-strength L1-norm and L2-norm allows the system to retain stability. Data-driven systems are critical for regionally-adaptable surveillance, management of control strategies and resource allocation across stretched healthcare systems.