Inductive Learning
Supervised Learning: No Loss No Cry
Nock, Richard, Menon, Aditya Krishna
Supervised learning requires the specification of a loss function to minimise. While the theory of admissible losses from both a computational and statistical perspective is well-developed, these offer a panoply of different choices. In practice, this choice is typically made in an \emph{ad hoc} manner. In hopes of making this procedure more principled, the problem of \emph{learning the loss function} for a downstream task (e.g., classification) has garnered recent interest. However, works in this area have been generally empirical in nature. In this paper, we revisit the {\sc SLIsotron} algorithm of Kakade et al. (2011) through a novel lens, derive a generalisation based on Bregman divergences, and show how it provides a principled procedure for learning the loss. In detail, we cast {\sc SLIsotron} as learning a loss from a family of composite square losses. By interpreting this through the lens of \emph{proper losses}, we derive a generalisation of {\sc SLIsotron} based on Bregman divergences. The resulting {\sc BregmanTron} algorithm jointly learns the loss along with the classifier. It comes equipped with a simple guarantee of convergence for the loss it learns, and its set of possible outputs comes with a guarantee of agnostic approximability of Bayes rule. Experiments indicate that the {\sc BregmanTron} substantially outperforms the {\sc SLIsotron}, and that the loss it learns can be minimized by other algorithms for different tasks, thereby opening the interesting problem of \textit{loss transfer} between domains.
Parallel Performance-Energy Predictive Modeling of Browsers: Case Study of Servo
Zambre, Rohit, Bergstrom, Lars, Beni, Laleh Aghababaie, Chandramowliswharan, Aparna
Mozilla Research is developing Servo, a parallel web browser engine, to exploit the benefits of parallelism and concurrency in the web rendering pipeline. Parallelization results in improved performance for pinterest.com but not for google.com. This is because the workload of a browser is dependent on the web page it is rendering. In many cases, the overhead of creating, deleting, and coordinating parallel work outweighs any of its benefits. In this paper, we model the relationship between web page primitives and a web browser's parallel performance using supervised learning. We discover a feature space that is representative of the parallelism available in a web page and characterize it using seven key features. Additionally, we consider energy usage trade-offs for different levels of performance improvements using automated labeling algorithms. Such a model allows us to predict the degree of parallelism available in a web page and decide whether or not to render a web page in parallel. This modeling is critical for improving the browser's performance and minimizing its energy usage. We evaluate our model by using Servo's layout stage as a case study. Experiments on a quad-core Intel Ivy Bridge (i7-3615QM) laptop show that we can improve performance and energy usage by up to 94.52% and 46.32% respectively on the 535 web pages considered in this study. Looking forward, we identify opportunities to apply this model to other stages of a browser's architecture as well as other performance- and energy-critical devices.
A Survey on Causal Inference
Yao, Liuyi, Chu, Zhixuan, Li, Sheng, Li, Yaliang, Gao, Jing, Zhang, Aidong
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.
Exploratory Machine Learning with Unknown Unknowns
Zhang, Yu-Jie, Zhao, Peng, Zhou, Zhi-Hua
In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to the known labels. In this paper, we study a new problem setting in which there are unknown classes in the training dataset misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates the training dataset by actively augmenting the feature space to discover potentially unknown labels. Our approach consists of three ingredients including rejection model, feature acquisition, and model cascade. The effectiveness is validated on both synthetic and real datasets.
Introduction to quasi-open set semi-supervised learning for big data analytics
Engelbrecht, Emile R., Preez, Johan A. du
State-of-the-art performance and low system complexity has made deep-learning an increasingly attractive solution for big data analytics. However, limiting assumptions of end-to-end learning regimes hinder the use of neural networks on large application-grade datasets. This work addresses the assumption that output class-labels are defined for all classes in the domain. The amount of data collected by modern-day sensors span over an incomprehensible range of potential classes. Therefore, we propose a new learning regime where only some, but not all, classes of the training data are of interest to the classification system. The semi-supervised learning scenario in big data requires the assumption of a partial class mismatch between labelled and unlabelled training data. With classification systems required to classify source classes indicated by labelled samples while separating novel classes indicated by unlabelled samples, we find ourselves in an open-set case (vs closed set with only source classes). However, introducing samples from novel classes into the training set indicates a more relaxed open-set case. As such, our proposed regime of \textit{quasi-open set semi-supervised learning} is introduced. We propose a suitable method to train under quasi-open set semi-supervised learning that makes use of Wasserstein generative adversarial networks (WGANs). A trained classification certainty estimation within the discriminator (or critic) network is used to enable a reject option for the classifier. By placing a threshold on this certainty estimation, the reject option accepts classifications of source classes and rejects novel classes. Big data end-to-end training is promoted by developing models that recognize input samples do not necessarily belong to output labels. We believe this essential for big data analytics, and urge more work under quasi-open set semi-supervised learning.
Revisiting Meta-Learning as Supervised Learning
Chao, Wei-Lun, Ye, Han-Jia, Zhan, De-Chuan, Campbell, Mark, Weinberger, Kilian Q.
Recent years have witnessed an abundance of new publications and approaches on meta-learning. This community-wide enthusiasm has sparked great insights but has also created a plethora of seemingly different frameworks, which can be hard to compare and evaluate. In this paper, we aim to provide a principled, unifying framework by revisiting and strengthening the connection between meta-learning and traditional supervised learning. By treating pairs of task-specific data sets and target models as (feature, label) samples, we can reduce many meta-learning algorithms to instances of supervised learning. This view not only unifies meta-learning into an intuitive and practical framework but also allows us to transfer insights from supervised learning directly to improve meta-learning. For example, we obtain a better understanding of generalization properties, and we can readily transfer well-understood techniques, such as model ensemble, pre-training, joint training, data augmentation, and even nearest neighbor based methods. We provide an intuitive analogy of these methods in the context of meta-learning and show that they give rise to significant improvements in model performance on few-shot learning.
Fase-AL -- Adaptation of Fast Adaptive Stacking of Ensembles for Supporting Active Learning
Ortiz-Dรญaz, Agustรญn Alejandro, Baldo, Fabiano, Mariรฑo, Laura Marรญa Palomino, Cabrera, Alberto Verdecia
Classification algorithms to mine data stream have been extensively studied in recent years. However, a lot of these algorithms are designed for supervised learning which requires labeled instances. Nevertheless, the labeling of the data is costly and time-consuming. Because of this, alternative learning paradigms have been proposed to reduce the cost of the labeling process without significant loss of model performance. Active learning is one of these paradigms, whose main objective is to build classification models that request the lowest possible number of labeled examples achieving adequate levels of accuracy. Therefore, this work presents the FASE-AL algorithm which induces classification models with non-labeled instances using Active Learning. FASE-AL is based on the algorithm Fast Adaptive Stacking of Ensembles (FASE). FASE is an ensemble algorithm that detects and adapts the model when the input data stream has concept drift. FASE-AL was compared with four different strategies of active learning found in the literature. Real and synthetic databases were used in the experiments. The algorithm achieves promising results in terms of the percentage of correctly classified instances.
Machine Learning Necessary for Deep Learning
In another article, we touched a bit on generalization. What is the relationship between the generalization error and the training error? Generalization is the concept of the machine learning algorithm being able to produce good predictions on previously unseen inputs. The red line represents the training error. If the horizontal axis is the quantity of training examples or time, depending on how you like to think about it, then with time this training error gets smaller and smaller.
Margin Maximization as Lossless Maximal Compression
Nikolaou, Nikolaos, Reeve, Henry, Brown, Gavin
The ultimate goal of a supervised learning algorithm is to produce models constructed on the training data that can generalize well to new examples. In classification, functional margin maximization -- correctly classifying as many training examples as possible with maximal confidence --has been known to construct models with good generalization guarantees. This work gives an information-theoretic interpretation of a margin maximizing model on a noiseless training dataset as one that achieves lossless maximal compression of said dataset -- i.e. extracts from the features all the useful information for predicting the label and no more. The connection offers new insights on generalization in supervised machine learning, showing margin maximization as a special case (that of classification) of a more general principle and explains the success and potential limitations of popular learning algorithms like gradient boosting. We support our observations with theoretical arguments and empirical evidence and identify interesting directions for future work.
A Primer on Domain Adaptation
Lemberger, Pirmin, Panico, Ivan
Standard supervised machine learning assumes that the distribution of the source samples used to train an algorithm is the same as the one of the target samples on which it is supposed to make predictions. However, as any data scientist will confirm, this is hardly ever the case in practice. The set of statistical and numerical methods that deal with such situations is known as domain adaptation, a field with a long and rich history. The myriad of methods available and the unfortunate lack of a clear and universally accepted terminology can however make the topic rather daunting for the newcomer. Therefore, rather than aiming at completeness, which leads to exhibiting a tedious catalog of methods, this pedagogical review aims at a coherent presentation of four important special cases: (1) \emph{prior shift}, a situation in which training samples were selected according to their labels without any knowledge of their actual distribution in the target, (2) \emph{covariate shift} which deals with a situation where training examples were picked according to their features but with some selection bias, (3) \emph{concept shift} where the dependence of the labels on the features defers between the source and the target, and last but not least (4) \emph{subspace mapping} which deals with a situation where features in the target have been subjected to an unknown distortion with respect to the source features. In each case we first build an intuition, next we provide the appropriate mathematical framework and eventually we describe a practical application.