Goto

Collaborating Authors

 Inductive Learning


Recovering from Biased Data: Can Fairness Constraints Improve Accuracy?

arXiv.org Artificial Intelligence

Multiple fairness constraints have been proposed in the literature, motivated by a range of concerns about how demographic groups might be treated unfairly by machine learning classifiers. In this work we consider a different motivation; learning from biased training data. We posit several ways in which training data may be biased, including having a more noisy or negatively biased labeling process on members of a disadvantaged group, or a decreased prevalence of positive or negative examples from the disadvantaged group, or both. Given such biased training data, Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. In particular, we find that the Equal Opportunity fairness constraint (Hardt, Price, and Srebro 2016) combined with ERM will provably recover the Bayes Optimal Classifier under a range of bias models. We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity. These theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy.


3 Main Approaches to Machine Learning Models - KDnuggets

#artificialintelligence

In September 2018, I published a blog about my forthcoming book on The Mathematical Foundations of Data Science. The central question we address is: How can we bridge the gap between mathematics needed for Artificial Intelligence (Deep Learning and Machine learning) with that taught in high schools (up to ages 17/18)? In this post, we present a chapter from this book called "A Taxonomy of Machine Learning Models." The book is now available for an early bird discount released as chapters. If you are interested in getting early discounted copies, please contact ajit.jaokar at feynlabs.ai.


Semi-Supervised Learning for Text Classification by Layer Partitioning

arXiv.org Machine Learning

Most recent neural semi-supervised learning algorithms rely on adding small perturbation to either the input vectors or their representations. These methods have been successful on computer vision tasks as the images form a continuous manifold, but are not appropriate for discrete input such as sentence. To adapt these methods to text input, we propose to decompose a neural network $M$ into two components $F$ and $U$ so that $M = U\circ F$. The layers in $F$ are then frozen and only the layers in $U$ will be updated during most time of the training. In this way, $F$ serves as a feature extractor that maps the input to high-level representation and adds systematical noise using dropout. We can then train $U$ using any state-of-the-art SSL algorithms such as $\Pi$-model, temporal ensembling, mean teacher, etc. Furthermore, this gradually unfreezing schedule also prevents a pretrained model from catastrophic forgetting. The experimental results demonstrate that our approach provides improvements when compared to state of the art methods especially on short texts.


Word-Class Embeddings for Multiclass Text Classification

arXiv.org Machine Learning

Pre-trained word embeddings encode general word semantics and lexical regularities of natural language, and have proven useful across many NLP tasks, including word sense disambiguation, machine translation, and sentiment analysis, to name a few. In supervised tasks such as multiclass text classification (the focus of this article) it seems appealing to enhance word representations with ad-hoc embeddings that encode task-specific information. We propose (supervised) word-class embeddings (WCEs), and show that, when concatenated to (unsupervised) pre-trained word embeddings, they substantially facilitate the training of deep-learning models in multiclass classification by topic. We show empirical evidence that WCEs yield a consistent improvement in multiclass classification accuracy, using four popular neural architectures and six widely used and publicly available datasets for multiclass text classification. Our code that implements WCEs is publicly available at https://github.com/AlexMoreo/word-class-embeddings


Discriminative training of conditional random fields with probably submodular constraints

arXiv.org Machine Learning

Problems of segmentation, denoising, registration and 3D reconstruction are often addressed with the graph cut algorithm. However, solving an unconstrained graph cut problem is NP-hard. For tractable optimization, pairwise potentials have to fulfill the submodularity inequality. In our learning paradigm, pairwise potentials are created as the dot product of a learned vector w with positive feature vectors. In order to constrain such a model to remain tractable, previous approaches have enforced the weight vector to be positive for pairwise potentials in which the labels differ, and set pairwise potentials to zero in the case that the label remains the same. Such constraints are sufficient to guarantee that the resulting pairwise potentials satisfy the submodularity inequality. However, we show that such an approach unnecessarily restricts the capacity of the learned models. Guaranteeing submodularity for all possible inputs, no matter how improbable, reduces inference error to effectively zero, but increases model error. In contrast, we relax the requirement of guaranteed submodularity to solutions that are probably approximately submodular. We show that the conceptually simple strategy of enforcing submodularity on the training examples guarantees with low sample complexity that test images will also yield submodular pairwise potentials. Results are presented in the binary and muticlass settings, showing substantial improvement from the resulting increased model capacity.


Few-Shot Knowledge Graph Completion

arXiv.org Artificial Intelligence

Knowledge graphs (KGs) serve as useful resources for various natural language processing applications. Previous KG completion approaches require a large number of training instances (i.e., head-tail entity pairs) for every relation. The real case is that for most of the relations, very few entity pairs are available. Existing work of one-shot learning limits method generalizability for few-shot scenarios and does not fully use the supervisory information; however, few-shot KG completion has not been well studied yet. In this work, we propose a novel few-shot relation learning model (FSRL) that aims at discovering facts of new relations with few-shot references. FSRL can effectively capture knowledge from heterogeneous graph structure, aggregate representations of few-shot references, and match similar entity pairs of reference set for every relation. Extensive experiments on two public datasets demonstrate that FSRL outperforms the state-of-the-art. Introduction Large-scale knowledge graphs (KGs) such as Y AGO (Suchanek, Kasneci, and Weikum 2007), NELL (Carlson et al. 2010), and Wikidata (Vrande ˇ ci c and Kr otzsch 2014) usually represent facts in the form of relations (edges) between (head-tail) entity pairs (nodes). This kind of graph-structured knowledge is essential for many downstream applications such as search, question answering, and semantic web.


Corpus Wide Argument Mining -- a Working Solution

arXiv.org Artificial Intelligence

One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates. 1 Introduction Starting with the seminal work of Mochales Palau and Moens (2009), argument mining has mainly focused on the following tasks - identifying argumentative text segments within a given document; labeling these text segments according to the type of argument and its stance; and elucidating the discourse relations among the detected arguments. Typically, the considered documents were argumentative in nature, taken from a well defined domain, such as legal documents or student essays. More recently, some attention had been given to the corresponding retrieval task - given a controversial topic, retrieve arguments with a clear stance towards this topic. This is usually done by first retrieving - manually or automatically - documents relevant to the topic, and then using argument mining techniques to identify relevant argumentative segments therein. This documents-based approach was originally explored over Wikipedia (Levy et al. 2014; Rinott et al. 2015), and more recently over the entire Web (Stab et al. 2018). For an argument retrieval system to be of practical use requires: (1) high precision, and (2) wide coverage.


AnoNet: Weakly Supervised Anomaly Detection in Textured Surfaces

arXiv.org Machine Learning

Humans can easily detect a defect (anomaly) because it is different or salient when compared to the surface it resides on. Today, manual human visual inspection is still the norm because it is difficult to automate anomaly detection. Neural networks are a useful tool that can teach a machine to find defects. However, they require a lot of training examples to learn what a defect is and it is tedious and expensive to get these samples. We tackle the problem of teaching a network with a low number of training samples with a system we call AnoNet. AnoNet's architecture is similar to CompactCNN with the exceptions that (1) it is a fully convolutional network and does not use strided convolution; (2) it is shallow and compact which minimizes over-fitting by design; (3) the compact design constrains the size of intermediate features which allows training to be done without image downsizing; (4) the model footprint is low making it suitable for edge computation; and (5) the anomaly can be detected and localized despite the weak labelling. AnoNet learns to detect the underlying shape of the anomalies despite the weak annotation as well as preserves the spatial localization of the anomaly. Pre-seeding AnoNet with an engineered filter bank initialization technique reduces the total samples required for training and also achieves state-of-the-art performance. Compared to the CompactCNN, AnoNet achieved a massive 94% reduction of network parameters from 1.13 million to 64 thousand parameters. Experiments were conducted on four data-sets and results were compared against CompactCNN and DeepLabv3. AnoNet improved the performance on an average across all data-sets by 106% to an F1 score of 0.98 and by 13% to an AUROC value of 0.942. AnoNet can learn from a limited number of images. For one of the data-sets, AnoNet learnt to detect anomalies after a single pass through just 53 training images.


Pro Tips: How to deal with Class Imbalance and Missing Labels - KDnuggets

#artificialintelligence

"Any AI smart enough to pass a Turing test is smart enough to know to fail it." Suppose you are working on a high-impact yet challenging problem of malware classification. You have a large dataset at your disposal and are able to train a machine learning classifier with an accuracy of 98%. While suppressing your excitement, you convince the team to deploy the model, as who would resist a model with such an amazing performance? Quite disappointingly, the model fails to detect threats in the real world!?


Instance Cross Entropy for Deep Metric Learning

arXiv.org Machine Learning

Loss functions play a crucial role in deep metric learning thus a variety of them have been proposed. Some supervise the learning process by pairwise or tripletwise similarity constraints while others take advantage of structured similarity information among multiple data points. In this work, we approach deep metric learning from a novel perspective. We propose instance cross entropy (ICE) which measures the difference between an estimated instance-level matching distribution and its ground-truth one. ICE has three main appealing properties. Firstly, similar to categorical cross entropy (CCE), ICE has clear probabilistic interpretation and exploits structured semantic similarity information for learning supervision. Secondly, ICE is scalable to infinite training data as it learns on mini-batches iteratively and is independent of the training set size. Thirdly, motivated by our relative weight analysis, seamless sample reweighting is incorporated. It rescales samples' gradients to control the differentiation degree over training examples instead of truncating them by sample mining. In addition to its simplicity and intuitiveness, extensive experiments on three real-world benchmarks demonstrate the superiority of ICE.