Goto

Collaborating Authors

 self-supervision


Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Neural Information Processing Systems

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56\% and 8.24\% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.


From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Neural Information Processing Systems

Reconstructing observed images from fMRI brain recordings is challenging. Unfortunately, acquiring sufficient ''labeled'' pairs of {Image, fMRI} (i.e., images with their corresponding fMRI responses) to span the huge space of natural images is prohibitive for many reasons. We present a novel approach which, in addition to the scarce labeled data (training pairs), allows to train fMRI-to-image reconstruction networks also on unlabeled data (i.e., images without fMRI recording, and fMRI recording without images). The proposed model utilizes both an Encoder network (image-to-fMRI) and a Decoder network (fMRI-to-image). Concatenating these two networks back-to-back (Encoder-Decoder & Decoder-Encoder) allows augmenting the training data with both types of unlabeled data. Importantly, it allows training on the unlabeled test-fMRI data.


DeepUSPS: Deep Robust Unsupervised Saliency Prediction via Self-supervision

Neural Information Processing Systems

Deep neural network (DNN) based salient object detection in images based on high-quality labels is expensive. Alternative unsupervised approaches rely on careful selection of multiple handcrafted saliency methods to generate noisy pseudo-ground-truth labels. In this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods. Each handcrafted method is substituted by a deep network that learns to generate the pseudo labels. These labels are refined incrementally in multiple iterations via our proposed self-supervision technique. In the second stage, the refined labels produced from multiple networks representing multiple saliency methods are used to train the actual saliency detection network. We show that this self-learning procedure outperforms all the existing unsupervised methods over different datasets. Results are even comparable to those of fully-supervised state-of-the-art approaches.


Adversarially Robust 3D Point Cloud Recognition Using Self-Supervisions

Neural Information Processing Systems

Thus, the robustness of 3D deep learning models against adversarial attacks becomes a major consideration. In this paper, we systematically study the impact of various self-supervised learning proxy tasks on different architectures and threat models for 3D point clouds with adversarial training. Specifically, we study MLP-based (PointNet), convolution-based (DGCNN), and transformer-based (PCT) 3D architectures. Through extensive experimentation, we demonstrate that appropriate applications of self-supervision can significantly enhance the robustness in 3D point cloud recognition, achieving considerable improvements compared to the standard adversarial training baseline. Our analysis reveals that local feature learning is desirable for adversarial robustness in point clouds since it limits the adversarial propagation between the point-level input perturbations and the model's final output. This insight also explains the success of DGCNN and the jigsaw proxy task in achieving stronger 3D adversarial robustness.


From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Neural Information Processing Systems

We thank the reviewers for their comments and endorsements. Below are our answers to the main questions/concerns. R1: Training on test-fMRI samples - not convinced the approach is valid. We understand the reviewer's concern. Note however that our "training on test data" refers only to training on We will better clarify the distinction between training on the "test-fMRI" (which is the input to the network, We realize that this distinction is confusing, and will clarify it.


MSDA: Combining Pseudo-labeling and Self-Supervision for Unsupervised Domain Adaptation in ASR

Damianos, Dimitrios, Paraskevopoulos, Georgios, Potamianos, Alexandros

arXiv.org Artificial Intelligence

In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.


Learning to Edit Visual Programs with Self-Supervision

Neural Information Processing Systems

We design a system that learns how to edit visual programs. Our edit network consumes a complete input program and a visual target. From this input, we task our network with predicting a local edit operation that could be applied to the input program to improve its similarity to the target. In order to apply this scheme for domains that lack program annotations, we develop a self-supervised learning approach that integrates this edit network into a bootstrapped finetuning loop along with a network that predicts entire programs in one-shot. Our joint finetuning scheme, when coupled with an inference procedure that initializes a population from the one-shot model and evolves members of this population with the edit network, helps to infer more accurate visual programs.


Reviews: From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Neural Information Processing Systems

The paper's writing and figures are of very high clarity and quality. The method is novel and the basic innovation is in the new objective function, which has encoder-decoder dynamics that are intriguing. The area of research is tackling the difficult problem of trying to reconstruct images from human brain activity with recent machine learning and neural network techniques, which is a strong fit for the NeurIPS conference. The results in Figure 4e) are impressive and look like a convincing improvement over Shen et al. 2019 as they do not need a generative model prior at all, but train an end-to-end architecture. The only ImageNet statistics in their network are pretrained low-level AlexNet features (thus also further lowering the potential influence of category set statistics).


Reviews: From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI

Neural Information Processing Systems

Dear authors, congrats on the acceptance-- this paper was discussed extensively, the the reviewers provided multiple comments and feedback-- please do take the feedback and requests of all the reviewers into account when preparing your final manuscript. In particular, it would be important to clearly describe in what settings (i.e.


Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Neural Information Processing Systems

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56\% and 8.24\% average reductions in EER and minDCF, respectively.