Inductive Learning
Long-Tail Theory under Gaussian Mixtures
Bolatov, Arman, Tezekbayev, Maxat, Melnykov, Igor, Pak, Artur, Nikoulina, Vassilina, Assylbekov, Zhenisbek
We suggest a simple Gaussian mixture model for data generation that complies with Feldman's long tail theory (2020). We demonstrate that a linear classifier cannot decrease the generalization error below a certain level in the proposed model, whereas a nonlinear classifier with a memorization capacity can. This confirms that for long-tailed distributions, rare training examples must be considered for optimal generalization to new data. Finally, we show that the performance gap between linear and nonlinear models can be lessened as the tail becomes shorter in the subpopulation frequency distribution, as confirmed by experiments on synthetic and real data.
Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data
Yuan, Hang, Chan, Shing, Creagh, Andrew P., Tong, Catherine, Clifton, David A., Doherty, Aiden
Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
Self-Supervised Learning for Audio-Based Emotion Recognition
Nimitsurachat, Peranut, Washington, Peter
Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.
Moderately Supervised Learning: Definition, Framework and Generality
Learning with supervision has achieved remarkable success in numerous artificial intelligence (AI) applications. In the current literature, by referring to the properties of the labels prepared for the training dataset, learning with supervision is categorized as supervised learning (SL) and weakly supervised learning (WSL). SL concerns the situation where the training data set is assigned with ideal (complete, exact and accurate) labels, while WSL concerns the situation where the training data set is assigned with non-ideal (incomplete, inexact or inaccurate) labels. However, various solutions for SL tasks have shown that the given labels are not always easy to learn, and the transformation from the given labels to easy-to-learn targets can significantly affect the performance of the final SL solutions. Without considering the properties of the transformation from the given labels to easy-to-learn targets, the definition of SL conceals some details that can be critical to building the appropriate solutions for specific SL tasks. Thus, for engineers in the AI application field, it is desirable to reveal these details systematically. This article attempts to achieve this goal by expanding the categorization of SL and investigating the sub-type moderately supervised learning (MSL) that concerns the situation where the given labels are ideal, but due to the simplicity in annotation, careful designs are required to transform the given labels into easy-to-learn targets. From the perspectives of the definition, framework and generality, we conceptualize MSL to present a complete fundamental basis to systematically analyse MSL tasks. At meantime, revealing the relation between the conceptualization of MSL and the mathematicians' vision, this paper as well establishes a tutorial for AI application engineers to refer to viewing a problem to be solved from the mathematicians' vision.
Learning Navigational Visual Representations with Semantic Map Supervision
Hong, Yicong, Zhou, Yang, Zhang, Ruiyi, Dernoncourt, Franck, Bui, Trung, Gould, Stephen, Tan, Hao
Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.
Hallucination Improves the Performance of Unsupervised Visual Representation Learning
Wu, Jing, Hobbs, Jennifer, Hovakimyan, Naira
Contrastive learning models based on Siamese structure have demonstrated remarkable performance in self-supervised learning. Such a success of contrastive learning relies on two conditions, a sufficient number of positive pairs and adequate variations between them. If the conditions are not met, these frameworks will lack semantic contrast and be fragile on overfitting. To address these two issues, we propose Hallucinator that could efficiently generate additional positive samples for further contrast. The Hallucinator is differentiable and creates new data in the feature space. Thus, it is optimized directly with the pre-training task and introduces nearly negligible computation. Moreover, we reduce the mutual information of hallucinated pairs and smooth them through non-linear operations. This process helps avoid over-confident contrastive learning models during the training and achieves more transformation-invariant feature embeddings. Remarkably, we empirically prove that the proposed Hallucinator generalizes well to various contrastive learning models, including MoCoV1&V2, SimCLR and SimSiam. Under the linear classification protocol, a stable accuracy gain is achieved, ranging from 0.3% to 3.0% on CIFAR10&100, Tiny ImageNet, STL-10 and ImageNet. The improvement is also observed in transferring pre-train encoders to the downstream tasks, including object detection and segmentation.
IRS whistleblower: 'Independent attorney' needed to fully execute Hunter Biden investigation
IRS Agent Joseph Ziegler joins'Special Report' to respond to critiques of hearing, letters from Del. prosecutor. The IRS special agent-turned-whistleblower formerly known as "Mr. X." spoke out to Fox News on Friday following an at-times contentious congressional hearing earlier this week. Joseph Ziegler, who came forward to Congress along with his colleague, Gary Shapley, said his team uncovered "a ton of evidence" that proved Hunter Biden allegedly willfully evaded or fraudulently filed his taxes, which set him apart from a typical IRS investigatory subject that would be faced with civil fines. Ziegler also told "Special Report" the ultimate reason he came forward as a whistleblower was that he saw many instances where federal officials were not following proper procedures.
Robust Visual Question Answering: Datasets, Methods, and Future Challenges
Ma, Jie, Wang, Pinghui, Kong, Dechen, Wang, Zewei, Liu, Jun, Pei, Hongbin, Zhao, Junzhou
Abstract--Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these datasets. Thirdly, we propose a typology that presents the development process, similarities and differences, robustness comparison, and technical features of existing debiasing methods. Furthermore, we analyze and discuss the robustness of representative vision-and-language pre-training models on VQA. Finally, through a thorough review of the available literature and experimental analysis, we discuss the key areas for future research from various viewpoints. Question Answering (VQA) aims to build intelligent machines that are able to provide a natural views. Second, a variety of VQA methods have language answer accurately given an image and a natural been proposed, which can be classified into three groups language question about the image [1].
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images
Tang, Jialiang, Chen, Shuo, Niu, Gang, Sugiyama, Masashi, Gong, Chen
Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed ``Knowledge Distillation between Different Distributions" (KD$^{3}$), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD$^{3}$ can outperform the state-of-the-art data-free knowledge distillation approaches.
A picture of the space of typical learnable tasks
Ramesh, Rahul, Mao, Jialin, Griniasty, Itay, Yang, Rubing, Teoh, Han Kheng, Transtrum, Mark, Sethna, James P., Chaudhari, Pratik
We develop information geometric techniques to understand the representations learned by deep networks when they are trained on different tasks using supervised, meta-, semi-supervised and contrastive learning. We shed light on the following phenomena that relate to the structure of the space of tasks: (1) the manifold of probabilistic models trained on different tasks using different representation learning methods is effectively low-dimensional; (2) supervised learning on one task results in a surprising amount of progress even on seemingly dissimilar tasks; progress on other tasks is larger if the training task has diverse classes; (3) the structure of the space of tasks indicated by our analysis is consistent with parts of the Wordnet phylogenetic tree; (4) episodic meta-learning algorithms and supervised learning traverse different trajectories during training but they fit similar models eventually; (5) contrastive and semi-supervised learning methods traverse trajectories similar to those of supervised learning. We use classification tasks constructed from the CIFAR-10 and Imagenet datasets to study these phenomena.