Inductive Learning
Unravelling Interlanguage Facts via Explainable Machine Learning
Berti, Barbara, Esuli, Andrea, Sebastiani, Fabrizio
Native language identification (NLI) is the task of training (via supervised machine learning) a classifier that guesses the native language of the author of a text. This task has been extensively researched in the last decade, and the performance of NLI systems has steadily improved over the years. We focus on a different facet of the NLI task, i.e., that of analysing the internals of an NLI classifier trained by an \emph{explainable} machine learning algorithm, in order to obtain explanations of its classification decisions, with the ultimate goal of gaining insight into which linguistic phenomena ``give a speaker's native language away''. We use this perspective in order to tackle both NLI and a (much less researched) companion task, i.e., guessing whether a text has been written by a native or a non-native speaker. Using three datasets of different provenance (two datasets of English learners' essays and a dataset of social media posts), we investigate which kind of linguistic traits (lexical, morphological, syntactic, and statistical) are most effective for solving our two tasks, namely, are most indicative of a speaker's L1. We also present two case studies, one on Spanish and one on Italian learners of English, in which we analyse individual linguistic traits that the classifiers have singled out as most important for spotting these L1s. Overall, our study shows that the use of explainable machine learning can be a valuable tool for th
DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning
Chen, Qianglong, Li, Feng-Lin, Xu, Guohai, Yan, Ming, Zhang, Ji, Zhang, Yin
Although pre-trained language models (PLMs) have achieved state-of-the-art performance on various natural language processing (NLP) tasks, they are shown to be lacking in knowledge when dealing with knowledge driven tasks. Despite the many efforts made for injecting knowledge into PLMs, this problem remains open. To address the challenge, we propose \textbf{DictBERT}, a novel approach that enhances PLMs with dictionary knowledge which is easier to acquire than knowledge graph (KG). During pre-training, we present two novel pre-training tasks to inject dictionary knowledge into PLMs via contrastive learning: \textit{dictionary entry prediction} and \textit{entry description discrimination}. In fine-tuning, we use the pre-trained DictBERT as a plugin knowledge base (KB) to retrieve implicit knowledge for identified entries in an input sequence, and infuse the retrieved knowledge into the input to enhance its representation via a novel extra-hop attention mechanism. We evaluate our approach on a variety of knowledge driven and language understanding tasks, including NER, relation extraction, CommonsenseQA, OpenBookQA and GLUE. Experimental results demonstrate that our model can significantly improve typical PLMs: it gains a substantial improvement of 0.5\%, 2.9\%, 9.0\%, 7.1\% and 3.3\% on BERT-large respectively, and is also effective on RoBERTa-large.
How Self-Supervised Learning Can be Used for Fine-Grained Head Pose Estimation?
Pourmirzaei, Mahdi, Esmaili, Farzaneh, Mousavi, Ebrahim, Karamizadeh, Sasan, Shojaeilangari, Seyedehsamaneh
The cost of head pose labeling is the main challenge of improving the fine-grained Head Pose Estimation (HPE). Although Self-Supervised Learning (SSL) can be a solution to the lack of huge amounts of labeled data, its efficacy for fine-grained HPE is not yet fully explored. This study aims to assess the usage of SSL in fine-grained HPE based on two scenarios: (1) using SSL for weights pre-training procedure, and (2) leveraging auxiliary SSL losses besides HPE. We design a Hybrid Multi-Task Learning (HMTL) architecture based on the ResNet50 backbone in which both strategies are applied. Our experimental results reveal that the combination of both scenarios is the best for HPE. Together, the average error rate is reduced up to 23.1% for AFLW2000 and 14.2% for BIWI benchmark compared to the baseline. Moreover, it is found that some SSL methods are more suitable for transfer learning, while others may be effective when they are considered as auxiliary tasks incorporated into supervised learning. Finally, it is shown that by using the proposed HMTL architecture, the average error is reduced with different types of initial weights: random, ImageNet and SSL pre-trained weights.
Adaptive Second Order Coresets for Data-efficient Machine Learning
Pooladzandi, Omead, Davini, David, Mirzasoleiman, Baharan
Training machine learning models on massive datasets incurs substantial computational costs. To alleviate such costs, there has been a sustained effort to develop data-efficient training methods that can carefully select subsets of the training examples that generalize on par with the full training data. However, existing methods are limited in providing theoretical guarantees for the quality of the models trained on the extracted subsets, and may perform poorly in practice. We propose AdaCore, a method that leverages the geometry of the data to extract subsets of the training examples for efficient machine learning. The key idea behind our method is to dynamically approximate the curvature of the loss function via an exponentially-averaged estimate of the Hessian to select weighted subsets (coresets) that provide a close approximation of the full gradient preconditioned with the Hessian. We prove rigorous guarantees for the convergence of various first and second-order methods applied to the subsets chosen by AdaCore. Our extensive experiments show that AdaCore extracts coresets with higher quality compared to baselines and speeds up training of convex and non-convex machine learning models, such as logistic regression and neural networks, by over 2.9x over the full data and 4.5x over random subsets.
OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning
Rizve, Mamshad Nayeem, Kardan, Navid, Khan, Salman, Khan, Fahad Shahbaz, Shah, Mubarak
Semi-supervised learning (SSL) is one of the dominant approaches to address the annotation bottleneck of supervised learning. Recent SSL methods can effectively leverage a large repository of unlabeled data to improve performance while relying on a small set of labeled data. One common assumption in most SSL methods is that the labeled and unlabeled data are from the same data distribution. However, this is hardly the case in many real-world scenarios, which limits their applicability. In this work, instead, we attempt to solve the challenging open-world SSL problem that does not make such an assumption. In the open-world SSL problem, the objective is to recognize samples of known classes, and simultaneously detect and cluster samples belonging to novel classes present in unlabeled data. This work introduces OpenLDN that utilizes a pairwise similarity loss to discover novel classes. Using a bi-level optimization rule this pairwise similarity loss exploits the information available in the labeled set to implicitly cluster novel class samples, while simultaneously recognizing samples from known classes. After discovering novel classes, OpenLDN transforms the open-world SSL problem into a standard SSL problem to achieve additional performance gains using existing SSL methods. Our extensive experiments demonstrate that OpenLDN outperforms the current state-of-the-art methods on multiple popular classification benchmarks while providing a better accuracy/training time trade-off.
RCA: Ride Comfort-Aware Visual Navigation via Self-Supervised Learning
Yao, Xinjie, Zhang, Ji, Oh, Jean
Under shared autonomy, wheelchair users expect vehicles to provide safe and comfortable rides while following users high-level navigation plans. To find such a path, vehicles negotiate with different terrains and assess their traversal difficulty. Most prior works model surroundings either through geometric representations or semantic classifications, which do not reflect perceived motion intensity and ride comfort in downstream navigation tasks. We propose to model ride comfort explicitly in traversability analysis using proprioceptive sensing. We develop a self-supervised learning framework to predict traversability costmap from first-person-view images by leveraging vehicle states as training signals. Our approach estimates how the vehicle would feel if traversing over based on terrain appearances. We then show our navigation system provides human-preferred ride comfort through robot experiments together with a human evaluation study.
Efficient Personalized Speech Enhancement through Self-Supervised Learning
This work presents self-supervised learning methods for developing monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their enhancement function towards a particular speaker's voice, expecting to solve a narrower problem. Hence, specialists are capable of achieving more optimal performance in addition to reducing computational complexity. However, naive personalization methods can require clean speech from the target user, which is inconvenient to acquire, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no additional clean speech of the target speaker is used for training, or a few-shot learning task, in which the goal is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero- and few-shot personalization tasks. The proposed methods are designed to learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) without knowing the corresponding clean sources. Our experiments investigate three different self-supervised learning mechanisms. The results show that self-supervised models achieve zero-shot and few-shot personalization using fewer model parameters and less clean data from the target user, achieving the data efficiency and model compression goals.
The Implications of the No-Free-Lunch Theorems for Meta-induction
The important recent book by G. Schurz appreciates that the no-free-lunch theorems (NFL) have major implications for the problem of (meta) induction. Here I review the NFL theorems, emphasizing that they do not only concern the case where there is a uniform prior -- they prove that there are "as many priors" (loosely speaking) for which any induction algorithm $A$ out-generalizes some induction algorithm $B$ as vice-versa. Importantly though, in addition to the NFL theorems, there are many {free lunch} theorems. In particular, the NFL theorems can only be used to compare the {marginal} expected performance of an induction algorithm $A$ with the marginal expected performance of an induction algorithm $B$. There is a rich set of free lunches which instead concern the statistical correlations among the generalization errors of induction algorithms. As I describe, the meta-induction algorithms that Schurz advocate as a "solution to Hume's problem" are just an example of such a free lunch based on correlations among the generalization errors of induction algorithms. I end by pointing out that the prior that Schurz advocates, which is uniform over bit frequencies rather than bit patterns, is contradicted by thousands of experiments in statistical physics and by the great success of the maximum entropy procedure in inductive inference.
Machine Learning for Natural Language Processing
We will cover methods from the machine learning literature that we view as an important toolset for empirical economics. We will discuss supervised learning methods for regression and classification, unsupervised learning methods, as well as text-analysis applications. Throughout the course, we highlight the intersection of ML and econometrics. We will use Python for algorithm implementation.
On Missing Labels, Long-tails and Propensities in Extreme Multi-label Classification
Schultheis, Erik, Wydmuch, Marek, Babbar, Rohit, Dembczyński, Krzysztof
The propensity model introduced by Jain et al. 2016 has become a standard approach for dealing with missing and long-tail labels in extreme multi-label classification (XMLC). In this paper, we critically revise this approach showing that despite its theoretical soundness, its application in contemporary XMLC works is debatable. We exhaustively discuss the flaws of the propensity-based approach, and present several recipes, some of them related to solutions used in search engines and recommender systems, that we believe constitute promising alternatives to be followed in XMLC.