Yu, Ha-Jin
MR-RawNet: Speaker verification system with multiple temporal resolutions for variable duration utterances using raw waveforms
Kim, Seung-bin, Lim, Chan-yeong, Heo, Jungwoo, Kim, Ju-ho, Shin, Hyun-seo, Koo, Kyo-Won, Yu, Ha-Jin
In speaker verification systems, the utilization of short utterances presents a persistent challenge, leading to performance degradation primarily due to insufficient phonetic information to characterize the speakers. To overcome this obstacle, we propose a novel structure, MR-RawNet, designed to enhance the robustness of speaker verification systems against variable duration utterances using raw waveforms. The MR-RawNet extracts time-frequency representations from raw waveforms via a multi-resolution feature extractor that optimally adjusts both temporal and spectral resolutions simultaneously. Furthermore, we apply a multi-resolution attention block that focuses on diverse and extensive temporal contexts, ensuring robustness against changes in utterance length. The experimental results, conducted on VoxCeleb1 dataset, demonstrate that the MR-RawNet exhibits superior performance in handling utterances of variable duration compared to other raw waveform-based systems.
NM-FlowGAN: Modeling sRGB Noise with a Hybrid Approach based on Normalizing Flows and Generative Adversarial Networks
Han, Young Joo, Yu, Ha-Jin
Modeling and synthesizing real sRGB noise is crucial for various low-level vision tasks. The distribution of real sRGB noise is highly complex and affected by a multitude of factors, making its accurate modeling extremely challenging. Therefore, recent studies have proposed methods that employ data-driven generative models, such as generative adversarial networks (GAN) and Normalizing Flows. These studies achieve more accurate modeling of sRGB noise compared to traditional noise modeling methods. However, there are performance limitations due to the inherent characteristics of each generative model. To address this issue, we propose NM-FlowGAN, a hybrid approach that exploits the strengths of both GAN and Normalizing Flows. We simultaneously employ a pixel-wise noise modeling network based on Normalizing Flows, and spatial correlation modeling networks based on GAN. In our experiments, our NM-FlowGAN outperforms other baselines on the sRGB noise synthesis task. Moreover, the denoising neural network, trained with synthesized image pairs from our model, also shows superior performance compared to other baselines. Our code is available at: https://github.com/YoungJooHan/NM-FlowGAN
HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods
Shin, Hyun-seo, Heo, Jungwoo, Kim, Ju-ho, Lim, Chan-yeong, Kim, Wonbin, Yu, Ha-Jin
Audio deepfake detection (ADD) is the task of detecting spoofing attacks generated by text-to-speech or voice conversion systems. Spoofing evidence, which helps to distinguish between spoofed and bona-fide utterances, might exist either locally or globally in the input features. To capture these, the Conformer, which consists of Transformers and CNN, possesses a suitable structure. However, since the Conformer was designed for sequence-to-sequence tasks, its direct application to ADD tasks may be sub-optimal. To tackle this limitation, we propose HM-Conformer by adopting two components: (1) Hierarchical pooling method progressively reducing the sequence length to eliminate duplicated information (2) Multi-level classification token aggregation method utilizing classification tokens to gather information from different blocks. Owing to these components, HM-Conformer can efficiently detect spoofing evidence by processing various sequence lengths and aggregating them. In experimental results on the ASVspoof 2021 Deepfake dataset, HM-Conformer achieved a 15.71% EER, showing competitive performance compared to recent systems.
SS-BSN: Attentive Blind-Spot Network for Self-Supervised Denoising with Nonlocal Self-Similarity
Han, Young-Joo, Yu, Ha-Jin
Recently, numerous studies have been conducted on supervised learning-based image denoising methods. However, these methods rely on large-scale noisy-clean image pairs, which are difficult to obtain in practice. Denoising methods with self-supervised training that can be trained with only noisy images have been proposed to address the limitation. These methods are based on the convolutional neural network (CNN) and have shown promising performance. However, CNN-based methods do not consider using nonlocal self-similarities essential in the traditional method, which can cause performance limitations. This paper presents self-similarity attention (SS-Attention), a novel self-attention module that can capture nonlocal self-similarities to solve the problem. We focus on designing a lightweight self-attention module in a pixel-wise manner, which is nearly impossible to implement using the classic self-attention module due to the quadratically increasing complexity with spatial resolution. Furthermore, we integrate SS-Attention into the blind-spot network called self-similarity-based blind-spot network (SS-BSN). We conduct the experiments on real-world image denoising tasks. The proposed method quantitatively and qualitatively outperforms state-of-the-art methods in self-supervised denoising on the Smartphone Image Denoising Dataset (SIDD) and Darmstadt Noise Dataset (DND) benchmark datasets.
AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks
Jung, Jee-weon, Heo, Hee-Soo, Tak, Hemlata, Shim, Hye-jin, Chung, Joon Son, Lee, Bong-Jin, Yu, Ha-Jin, Evans, Nicholas
Artefacts that differentiate spoofed from bona-fide utterances can reside in spectral or temporal domains. Their reliable detection usually depends upon computationally demanding ensemble systems where each subsystem is tuned to some specific artefacts. We seek to develop an efficient, single system that can detect a broad range of different spoofing attacks without score-level ensembles. We propose a novel heterogeneous stacking graph attention layer which models artefacts spanning heterogeneous temporal and spectral domains with a heterogeneous attention mechanism and a stack node. With a new max graph operation that involves a competitive mechanism and an extended readout scheme, our approach, named AASIST, outperforms the current state-of-the-art by 20% relative. Even a lightweight variant, AASIST-L, with only 85K parameters, outperforms all competing systems.
Attentive Max Feature Map for Acoustic Scene Classification with Joint Learning considering the Abstraction of Classes
Shim, Hye-jin, Kim, Ju-ho, Jung, Jee-weon, Yu, Ha-Jin
The attention mechanism has been widely adopted in acoustic scene classification. However, we find that during the process of attention exclusively emphasizing information, it tends to excessively discard information although improving the performance. We propose a mechanism referred to as the attentive max feature map which combines two effective techniques, attention and max feature map, to further elaborate the attention mechanism and mitigate the abovementioned phenomenon. Furthermore, we explore various joint learning methods that utilize additional labels originally generated for subtask B (3-classes) on top of existing labels for subtask A (10-classes) of the DCASE2020 challenge. We expect that using two kinds of labels simultaneously would be helpful because the labels of the two subtasks differ in their degree of abstraction. Applying two proposed techniques, our proposed system achieves state-of-the-art performance among single systems on subtask A. In addition, because the model has a complexity comparable to subtask B's requirement, it shows the possibility of developing a system that fulfills the requirements of both subtasks; generalization on multiple devices and low-complexity.
Self-supervised pre-training with acoustic configurations for replay spoofing detection
Shim, Hye-jin, Heo, Hee-Soo, Jung, Jee-weon, Yu, Ha-Jin
Large datasets are well-known as a key to the recent advances in deep learning. However, dataset construction, especially for replay spoofing detection, requires the physical process of playing an utterance and re-recording it, which hinders the construction of large-scale datasets. To compensate for the limited availability of replay spoofing datasets, in this study, we propose a method for pre-training acoustic configurations using external data unrelated to replay attacks. Here, acoustic configurations refer to variables present in the process of a voice being uttered by a speaker and recorded through a microphone. Specifically, we select pairs of audio segments and train the network to determine whether the acoustic configurations of two segments are identical. We conducted experiments using the ASVspoof 2019 physical access dataset, and the results revealed that our proposed method reduced the relative error rate by over 37% compared to the baseline.
Cosine similarity-based adversarial process
Heo, Hee-Soo, Jung, Jee-weon, Shim, Hye-jin, Yang, IL-Ho, Yu, Ha-Jin
An adversarial process between two deep neural networks is a promising approach to train a robust model. In this paper, we propose an adversarial process using cosine similarity, whereas conventional adversarial processes are based on inverted categorical cross entropy (CCE). When used for training an identification model, the adversarial process induces the competition of two discriminative models; one for a primary task such as speaker identification or image recognition, the other one for a subsidiary task such as channel identification or domain identification. In particular, the adversarial process degrades the performance of the subsidiary model by eliminating the subsidiary information in the input which, in assumption, may degrade the performance of the primary model. The conventional adversarial processes maximize the CCE of the subsidiary model to degrade the performance. We have studied a framework for training robust discriminative models by eliminating channel or domain information (subsidiary information) by applying such an adversarial process. However, we found through experiments that using the process of maximizing the CCE does not guarantee the performance degradation of the subsidiary model. In the proposed adversarial process using cosine similarity, on the contrary, the performance of the subsidiary model can be degraded more efficiently by searching feature space orthogonal to the subsidiary model. The experiments on speaker identification and image recognition show that we found features that make the outputs of the subsidiary models independent of the input, and the performances of the primary models are improved.
Replay attack spoofing detection system using replay noise by multi-task learning
Shim, Hye-Jin, Jung, Jee-weon, Heo, Hee-Soo, Yoon, Sunghyun, Yu, Ha-Jin
In this paper, we propose a spoofing detection system for replay attack using replay noise. In many previous studies across various domains, noise has been reduced. However, in replay attack, we hypothesize that noise is the prominent feature which is different with original signal and it can be one of the keys to find whether a signal has been spoofed. We define the noise that is caused by the replay attack as replay noise. Specifically, the noise of playback devices, recording environments, and recording devices, is included in the replay noise. We explore the effectiveness of training a deep neural network simultaneously for replay attack spoofing detection and replay noise classification. Multi-task learning was exploited to embed spoofing detection and replay noise classification in the code layer. The experiment results on the ASVspoof2017 datasets demonstrate that the performance of our proposed system is relatively improved 30% on the evaluation set.