Collaborating Authors

Copying Machine Learning Classifiers Machine Learning

We study model-agnostic copies of machine learning classifiers. We develop the theory behind the problem of copying, highlighting its differences with that of learning, and propose a framework to copy the functionality of any classifier using no prior knowledge of its parameters or training data distribution. We identify the different sources of loss and provide guidelines on how best to generate synthetic sets for the copying process. We further introduce a set of metrics to evaluate copies in practice. We validate our framework through extensive experiments using data from a series of well-known problems. We demonstrate the value of copies in use cases where desiderata such as interpretability, fairness or productivization constrains need to be addressed. Results show that copies can be exploited to enhance existing solutions and improve them adding new features and characteristics.

Multi-fidelity classification using Gaussian processes: accelerating the prediction of large-scale computational models Machine Learning

Machine learning techniques typically rely on large datasets to create accurate classifiers. However, there are situations when data is scarce and expensive to acquire. This is the case of studies that rely on state-of-the-art computational models which typically take days to run, thus hindering the potential of machine learning tools. In this work, we present a novel classifier that takes advantage of lower fidelity models and inexpensive approximations to predict the binary output of expensive computer simulations. We postulate an autoregressive model between the different levels of fidelity with Gaussian process priors. We adopt a fully Bayesian treatment for the hyper-parameters and use Markov Chain Mont Carlo samplers. We take advantage of the probabilistic nature of the classifier to implement active learning strategies. We also introduce a sparse approximation to enhance the ability of themulti-fidelity classifier to handle large datasets. We test these multi-fidelity classifiers against their single-fidelity counterpart with synthetic data, showing a median computational cost reduction of 23% for a target accuracy of 90%. In an application to cardiac electrophysiology, the multi-fidelity classifier achieves an F1 score, the harmonic mean of precision and recall, of 99.6% compared to 74.1% of a single-fidelity classifier when both are trained with 50 samples. In general, our results show that the multi-fidelity classifiers outperform their single-fidelity counterpart in terms of accuracy in all cases. We envision that this new tool will enable researchers to study classification problems that would otherwise be prohibitively expensive. Source code is available at

Active Expansion Sampling for Learning Feasible Domains in an Unbounded Input Space Machine Learning

Many engineering problems require identifying feasible domains under implicit constraints. One example is finding acceptable car body styling designs based on constraints like aesthetics and functionality. Current active-learning based methods learn feasible domains for bounded input spaces. However, we usually lack prior knowledge about how to set those input variable bounds. Bounds that are too small will fail to cover all feasible domains; while bounds that are too large will waste query budget. To avoid this problem, we introduce Active Expansion Sampling (AES), a method that identifies (possibly disconnected) feasible domains over an unbounded input space. AES progressively expands our knowledge of the input space, and uses successive exploitation and exploration stages to switch between learning the decision boundary and searching for new feasible domains. We show that AES has a misclassification loss guarantee within the explored region, independent of the number of iterations or labeled samples. Thus it can be used for real-time prediction of samples' feasibility within the explored region. We evaluate AES on three test examples and compare AES with two adaptive sampling methods -- the Neighborhood-Voronoi algorithm and the straddle heuristic -- that operate over fixed input variable bounds.

Limitations of Assessing Active Learning Performance at Runtime Machine Learning

Classification algorithms aim to predict an unknown label (e.g., a quality class) for a new instance (e.g., a product). Therefore, training samples (instances and labels) are used to deduct classification hypotheses. Often, it is relatively easy to capture instances but the acquisition of the corresponding labels remain difficult or expensive. Active learning algorithms select the most beneficial instances to be labeled to reduce cost. In research, this labeling procedure is simulated and therefore a ground truth is available. But during deployment, active learning is a one-shot problem and an evaluation set is not available. Hence, it is not possible to reliably estimate the performance of the classification system during learning and it is difficult to decide when the system fulfills the quality requirements (stopping criteria). In this article, we formalize the task and review existing strategies to assess the performance of an actively trained classifier during training. Furthermore, we identified three major challenges: 1)~to derive a performance distribution, 2)~to preserve representativeness of the labeled subset, and 3) to correct against sampling bias induced by an intelligent selection strategy. In a qualitative analysis, we evaluate different existing approaches and show that none of them reliably estimates active learning performance stating a major challenge for future research for such systems. All plots and experiments are provided in a Jupyter notebook that is available for download.

Discriminative Active Learning Machine Learning

We propose a new batch mode active learning algorithm designed for neural networks and large query batch sizes. The method, Discriminative Active Learning (DAL), poses active learning as a binary classification task, attempting to choose examples to label in such a way as to make the labeled set and the unlabeled pool indistinguishable. Experimenting on image classification tasks, we empirically show our method to be on par with state of the art methods in medium and large query batch sizes, while being simple to implement and also extend to other domains besides classification tasks. Our experiments also show that none of the state of the art methods of today are clearly better than uncertainty sampling when the batch size is relatively large, negating some of the reported results in the recent literature.