Goto

Collaborating Authors

 label augmentation


InjectingDomainKnowledgefromEmpirical InteratomicPotentialstoNeuralNetworksfor PredictingMaterialProperties

Neural Information Processing Systems

This limited range of interaction gives rise to the concept of an atomicenvironment. Theconsequence ofthislocality is that an infinite system can be modeled exactly using afinite periodic cell so long as asufficient number ofperiodic images surrounding itareexplicitly accounted for. It consists of configurations for elemental Ag and Au, as well as AgAu binary alloy. The dataset is generated using an activelearning strategy. This process is iterated multiple times.


Self-Supervised GANs with Label Augmentation

Neural Information Processing Systems

Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets.


Back to Supervision: Boosting Word Boundary Detection through Frame Classification

Carnemolla, Simone, Calcagno, Salvatore, Palazzo, Simone, Giordano, Daniela

arXiv.org Artificial Intelligence

Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye dataset and 0.7436 on the TIMIT dataset, along with R-values of 0.8489 and 0.7807, respectively. These results establish a new state-of-the-art for both datasets. Beyond the immediate task, our approach offers a robust and efficient preprocessing method for future research in audio tokenization.


Self-Supervised GANs with Label Augmentation

Neural Information Processing Systems

Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate catastrophic forgetting in the discriminator by introducing a stationary learning environment. However, the separate self-supervised tasks in existing self-supervised GANs cause a goal inconsistent with generative modeling due to the fact that their self-supervised classifiers are agnostic to the generator distribution. To address this problem, we propose a novel self-supervised GAN that unifies the GAN task with the self-supervised task by augmenting the GAN labels (real or fake) via self-supervision of data transformation. Specifically, the original discriminator and self-supervised classifier are unified into a label-augmented discriminator that predicts the augmented labels to be aware of both the generator distribution and the data distribution under every transformation, and then provide the discrepancy between them to optimize the generator. Theoretically, we prove that the optimal generator could converge to replicate the real data distribution. Empirically, we show that the proposed method significantly outperforms previous self-supervised and data augmentation GANs on both generative modeling and representation learning across benchmark datasets.


How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

Lin, Sheng-Chieh, Asai, Akari, Li, Minghan, Oguz, Barlas, Lin, Jimmy, Mehdad, Yashar, Yih, Wen-tau, Chen, Xilun

arXiv.org Artificial Intelligence

Various techniques have been developed in recent years to improve dense retrieval (DR), such as unsupervised contrastive learning and pseudo-query generation. Existing DRs, however, often suffer from effectiveness tradeoffs between supervised and zero-shot retrieval, which some argue was due to the limited model capacity. We contradict this hypothesis and show that a generalizable DR can be trained to achieve high accuracy in both supervised and zero-shot retrieval without increasing model size. In particular, we systematically examine the contrastive learning of DRs, under the framework of Data Augmentation (DA). Our study shows that common DA practices such as query augmentation with generative models and pseudo-relevance label creation using a cross-encoder, are often inefficient and sub-optimal. We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).


Towards Data-Efficient Detection Transformers

Wang, Wen, Zhang, Jing, Cao, Yang, Shen, Yongliang, Tao, Dacheng

arXiv.org Artificial Intelligence

Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency. Experiments show that our method can be readily applied to different detection transformers and improve their performance on both small-size and sample-rich datasets. Code will be made publicly available at \url{https://github.com/encounter1997/DE-DETRs}.


Improving End-To-End Modeling for Mispronunciation Detection with Effective Augmentation Mechanisms

Lo, Tien-Hong, Sung, Yao-Ting, Chen, Berlin

arXiv.org Artificial Intelligence

Recently, end-to-end (E2E) models, which allow to take spectral vector sequences of L2 (second-language) learners' utterances as input and produce the corresponding phone-level sequences as output, have attracted much research attention in developing mispronunciation detection (MD) systems. However, due to the lack of sufficient labeled speech data of L2 speakers for model estimation, E2E MD models are prone to overfitting in relation to conventional ones that are built on DNN-HMM acoustic models. To alleviate this critical issue, we in this paper propose two modeling strategies to enhance the discrimination capability of E2E MD models, each of which can implicitly leverage the phonetic and phonological traits encoded in a pretrained acoustic model and contained within reference transcripts of the training data, respectively. The first one is input augmentation, which aims to distill knowledge about phonetic discrimination from a DNN-HMM acoustic model. The second one is label augmentation, which manages to capture more phonological patterns from the transcripts of training data. A series of empirical experiments conducted on the L2-ARCTIC English dataset seem to confirm the efficacy of our E2E MD model when compared to some top-of-the-line E2E MD models and a classic pronunciation-scoring based method built on a DNN-HMM acoustic model.


Conditional Vehicle Trajectories Prediction in CARLA Urban Environment

Buhet, Thibault, Wirbel, Emilie, Perrotton, Xavier

arXiv.org Artificial Intelligence

Imitation learning is becoming more and more successful for autonomous driving. End-to-end (raw signal to command) performs well on relatively simple tasks (lane keeping and navigation). Mid-to-mid (environment abstraction to mid-level trajectory representation) or direct perception (raw signal to performance) approaches strive to handle more complex, real life environment and tasks (e.g. In this work, we show that complex urban situations can be handled with raw signal input and mid-level representation. W e build a hybrid end-to-mid approach predicting trajectories for neighbor vehicles and for the ego vehicle with a conditional navigation goal. W e propose an original architecture inspired from social pooling LSTM taking low and mid level data as input and producing trajectories as polynomials of time. W e introduce a label augmentation mechanism to get the level of generalization that is required to control a vehicle. The performance is evaluated on CARLA 0.8 benchmark, showing significant improvements over previously published state of the art. 1. Introduction Modular pipelines [32] are the most used approach to autonomous driving. The advantage is that the modules are interpretable and relatively mature, in particular on the perception side with the success of deep learning for object detection ([13, 20] among many others). However, the complexity of the interactions in the real world causes the pipeline to be also complex, especially in the planning and decision modules.


End to End Vehicle Lateral Control Using a Single Fisheye Camera

Toromanoff, Marin, Wirbel, Emilie, Wilhelm, Frédéric, Vejarano, Camilo, Perrotton, Xavier, Moutarde, Fabien

arXiv.org Machine Learning

Abstract-- Convolutional neural networks are commonly used to control the steering angle for autonomous cars. Most of the time, multiple long range cameras are used to generate lateral failure cases. In this paper we present a novel model to generate this data and label augmentation using only one short range fisheye camera. We present our simulator and how it can be used as a consistent metric for lateral end-to-end control evaluation. Experiments are conducted on a custom dataset corresponding to more than 10000 km and 200 hours of open road driving. Finally we evaluate this model on real world driving scenarios, open road and a custom test track with challenging obstacle avoidance and sharp turns. In our simulator based on real-world videos, the final model was capable of more than 99% autonomy on urban road. The ultimate goal for autonomous vehicles is to drive in any environment without any human input. To achieve this, autonomous cars have to analyze their environment using data coming from different sensors and control the car accordingly. In the most common approach, this task is cut into different modules then fed into a rule-based control algorithm which actually drives the car.