Inductive Learning
Cross-Modal Data Programming Enables Rapid Medical Machine Learning
Dunnmon, Jared, Ratner, Alexander, Khandwala, Nishith, Saab, Khaled, Markert, Matthew, Sagreiya, Hersh, Goldman, Roger, Lee-Messer, Christopher, Lungren, Matthew, Rubin, Daniel, Ré, Christopher
Department of Biomedical Data Science, Stanford University, Stanford, California, USA Labeling training datasets has become a key barrier to building medical machine learning models. One strategy is to generate training labels programmatically, for example by applying natural language processing pipelines to text reports associated with imaging studies. We propose cross-modal data programming, which generalizes this intuitive strategy in a theoretically-grounded way that enables simpler, clinician-driven input, reduces required labeling time, and improves with additional unlabeled data. In this approach, clinicians generate training labels for models defined over a target modality (e.g. The resulting technical challenge consists of estimating the accuracies and correlations of these rules; we extend a recent unsupervised generative modeling technique to handle this cross-modal setting in a provably consistent way. Across four applications in radiography, computed tomography, and electroencephalography, and using only several hours of clinician time, our approach matches or exceeds the efficacy of physician-months of hand-labeling with statistical significance, demonstrating a fundamentally faster and more flexible way of building machine learning models in medicine. In addition to being extremely costly, these training sets are inflexible: given a new classification schema, imaging system, patient population, or other change in the data distribution or modeling task, the training set generally needs to be relabeled from scratch. One manifestation of this shift in the broader machine learning community is the increasing use of weak supervision approaches, where training data is labeled in noisier, higher-level, often programmatic ways, rather than manually by experts. We broadly characterize these methods as cross-modal weak supervision approaches, in which the strategy is to programmatically extract labels from an auxiliary modality--e.g. the unstructured text reports accompanying an imaging study--which are then used as training labels for a model defined over the target modality, e.g. These methods follow the intuition that programmatically extracting labels from the auxiliary modality can be far faster and easier than hand-labeling or deriving labels from the target modality directly.
Sparse Learning for Variable Selection with Structures and Nonlinearities
In this thesis we discuss machine learning methods performing automated variable selection for learning sparse predictive models. There are multiple reasons for promoting sparsity in the predictive models. By relying on a limited set of input variables the models naturally counteract the overfitting problem ubiquitous in learning from finite sets of training points. Sparse models are cheaper to use for predictions, they usually require lower computational resources and by relying on smaller sets of inputs can possibly reduce costs for data collection and storage. Sparse models can also contribute to better understanding of the investigated phenomenons as they are easier to interpret than full models.
Machine Learning Methods Economists Should Know About
We discuss the relevance of the recent Machine Learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the machine learning literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, as well as matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics, methods that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, problems that include causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
Comparison of Hand-held WEMI Target Detection Algorithms
McCurley, Connor H., Bocinsky, James, Zare, Alina
Wide-band Electromagnetic Induction Sensors (WEMI) have been used for a number of years in subsurface detection of explosive hazards. While WEMI sensors have proven effective at localizing objects exhibiting large magnetic responses, detecting objects lacking or containing very low amounts of conductive materials can be challenging. In this paper, we compare a number of target detection algorithms in the literature in terms of detection performance. In the comparison, methods are tested on two real-world data sets: one containing relatively low amounts of ground noise pollution, and the other demonstrating highly-magnetic soil interference. Results are quantitatively evaluated through receiver-operator characteristic (ROC) curves and are used to highlight the strengths and weaknesses of the compared approaches in hand-held explosive hazard detection.
CS 229 - Supervised Learning Cheatsheet
Given a set of data points $\{x {(1)}, ..., x {(m)}\}$ associated to a set of outcomes $\{y {(1)}, ..., y {(m)}\}$, we want to build a classifier that learns how to predict $y$ from $x$. Hypothesis ― The hypothesis is noted $h_\theta$ and is the model that we choose. For a given input data $x {(i)}$ the model prediction output is $h_\theta(x {(i)})$. Loss function ― A loss function is a function $L:(z,y)\in\mathbb{R}\times Y\longmapsto L(z,y)\in\mathbb{R}$ that takes as inputs the predicted value $z$ corresponding to the real data value $y$ and outputs how different they are. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.
Teaching with IMPACT
Trimbach, Carl, Littman, Michael
Like many problems in AI in their general form, supervised learning is computationally intractable. We hypothesize that an important reason humans can learn highly complex and varied concepts, in spite of the computational difficulty, is that they benefit tremendously from experienced and insightful teachers. This paper proposes a new learning framework that provides a role for a knowledgeable, benevolent teacher to guide the process of learning a target concept in a series of "curricular" phases or rounds. In each round, the teacher's role is to act as a moderator, exposing the learner to a subset of the available training data to move it closer to mastering the target concept. Via both theoretical and empirical evidence, we argue that this framework enables simple, efficient learners to acquire very complex concepts from examples. In particular, we provide multiple examples of concept classes that are known to be unlearnable in the standard PAC setting along with provably efficient algorithms for learning them in our extended setting. A key focus of our work is the ability to learn complex concepts on top of simpler, previously learned, concepts---a direction with the potential of creating more competent artificial agents.
Comparing Different Classification Machine Learning Models for an imbalanced dataset
A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data sets are unbalanced when at least one class is represented by only a small number of training examples (called the minority class) while other classes make up the majority. In this scenario, classifiers can have good accuracy on the majority class but very poor accuracy on the minority class(es) due to the influence that the larger majority class. The common example of such dataset is credit card fraud detection, where data points for fraud 1, are usually very less in comparison to fraud 0. There are many reasons why a dataset might be imbalanced: the category one is targeting might be very rare in the population, or the data might simply be difficult to collect. Let's solve the problem of an imbalanced dataset by working on one such dataset.
$L^\gamma$-PageRank for Semi-Supervised Learning
Bautista, Esteban, Abry, Patrice, Gonçalves, Paulo
PageRank for Semi-Supervised Learning has shown to leverage data structures and limited tagged examples to yield meaningful classification. Despite successes, classification performance can still be improved, particularly in cases of fuzzy graphs or unbalanced labeled data. To address such limitations, a novel approach based on powers of the Laplacian matrix $L^\gamma$ ($\gamma > 0$), referred to as $L^\gamma$-PageRank, is proposed. Its theoretical study shows that it operates on signed graphs, where nodes belonging to one same class are more likely to share positive edges while nodes from different classes are more likely to be connected with negative edges. It is shown that by selecting an optimal $\gamma$, classification performance can be significantly enhanced. A procedure for the automated estimation of the optimal $\gamma$, from a unique observation of data, is devised and assessed. Experiments on several datasets demonstrate the effectiveness of both $L^\gamma$-PageRank classification and the optimal $\gamma$ estimation.
Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification
Borkan, Daniel, Dixon, Lucas, Sorensen, Jeffrey, Thain, Nithum, Vasserman, Lucy
Machine learning systems, if not constrained, will compounding existing challenges to fairness in society often learn the simplest associations that can predict the labels, so at large. In this paper, we introduce a suite of threshold-agnostic any incorrect associations present in the training data can produce metrics that provide a nuanced view of this unintended bias, by unintended associations in the final model. Toxicity models specifically considering the various ways that a classifier's score distribution have been shown to capture and reproduce biases common can vary across designated groups. We also introduce a large new in society, for example mis-associating the names of frequently test set of online comments with crowd-sourced annotations for attacked identity groups (such as "gay", and "muslim" etc.) with identity references. We use this to show how our metrics can be toxicity [5, 17]. This unintended model bias could be due to the used to find new and potentially subtle unintended bias in existing demographic composition of the online user pool, the latent or public models.
AI did my homework
On Valentine's Day, a non-profit research company called OpenAI, gifted us a paper with a blog post that rocked my world as an educator. We've trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization -- all without task-specific training. In other words, they had demonstrated how a language processing AI could learn, from millions of webpages, how to undertake written tasks (some of which are of reasonably high quality in terms of sense and coherence) without specifically being trained to do this via a supervised learning process. To understand the implications of the research, it is worth trying to get to grips with what supervised and unsupervised machine learning is (for someone like me this is a steep learning curve!). In supervised learning problems, we start with a data set containing training examples with associated correct labels.