AITopics | Lee, Kong Aik

Collaborating Authors

Lee, Kong Aik

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

On the Generation and Removal of Speaker Adversarial Perturbation for Voice-Privacy Protection

Guo, Chenyang, Chen, Liping, Li, Zhuhai, Lee, Kong Aik, Ling, Zhen-Hua, Guo, Wu

arXiv.org Artificial IntelligenceDec-12-2024

Neural networks are commonly known to be vulnerable to adversarial attacks mounted through subtle perturbation on the input data. Recent development in voice-privacy protection has shown the positive use cases of the same technique to conceal speaker's voice attribute with additive perturbation signal generated by an adversarial network. This paper examines the reversibility property where an entity generating the adversarial perturbations is authorized to remove them and restore original speech (e.g., the speaker him/herself). A similar technique could also be used by an investigator to deanonymize a voice-protected speech to restore criminals' identities in security and forensic analysis. In this setting, the perturbation generative module is assumed to be known in the removal process. To this end, a joint training of perturbation generation and removal modules is proposed. Experimental results on the LibriSpeech dataset demonstrated that the subtle perturbations added to the original speech can be predicted from the anonymized speech while achieving the goal of privacy protection. By removing these perturbations from the anonymized sample, the original speech can be restored. Audio samples can be found in \url{https://voiceprivacy.github.io/Perturbation-Generation-Removal/}.

artificial intelligence, machine learning, perturbation, (18 more...)

arXiv.org Artificial Intelligence

2412.09195

Country: Asia > China (0.47)

Genre: Research Report > New Finding (0.69)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

NTU-NPU System for Voice Privacy 2024 Challenge

Kuzmin, Nikita, Luong, Hieu-Thi, Yao, Jixun, Xie, Lei, Lee, Kong Aik, Chng, Eng Siong

arXiv.org Artificial IntelligenceOct-3-2024

B3 The baseline system B3 uses a Wasserstein generative adversarial In this work, we describe our submissions for the Voice Privacy network with Quadratic Transport Cost (WGAN-QC) [6] Challenge 2024. Rather than proposing a novel speech to generate artificial pseudo-speaker embeddings, anonymizing anonymization system, we enhance the provided baselines to the speaker's identity through four main steps: meet all required conditions and improve evaluated metrics. Specifically, we implement emotion embedding and experiment 1. Phonetic Transcriptions Extraction: Phonetic transcriptions with WavLM and ECAPA2 speaker embedders for the B3 baseline.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/SPSC.2024-13

2410.02371

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)

Add feedback

Cosine Scoring with Uncertainty for Neural Speaker Embedding

Wang, Qiongqiong, Lee, Kong Aik

arXiv.org Artificial IntelligenceMar-10-2024

Uncertainty modeling in speaker representation aims to learn the variability present in speech utterances. While the conventional cosine-scoring is computationally efficient and prevalent in speaker recognition, it lacks the capability to handle uncertainty. To address this challenge, this paper proposes an approach for estimating uncertainty at the speaker embedding front-end and propagating it to the cosine scoring back-end. Experiments conducted on the VoxCeleb and SITW datasets confirmed the efficacy of the proposed method in handling uncertainty arising from embedding estimation. It achieved improvement with 8.5% and 9.8% average reductions in EER and minDCF compared to the conventional cosine similarity. It is also computationally efficient in practice.

artificial intelligence, machine learning, recognition, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/LSP.2024.3375080

2403.06404

Country: Asia > China > Hong Kong (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis

Lin, Weiwei, He, Chenhang, Mak, Man-Wai, Lian, Jiachen, Lee, Kong Aik

arXiv.org Artificial IntelligenceMar-1-2024

Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence. Although significant progress has been made in recent years, the mainstream of speech synthesis models still relies on supervised speaker modeling and explicit reference utterances. However, there are many aspects of human voice, such as emotion, intonation, and speaking style, for which it is hard to obtain accurate labels. In this paper, we propose VoxGenesis, a novel unsupervised speech synthesis framework that can discover a latent speaker manifold and meaningful voice editing directions without supervision. VoxGenesis is conceptually simple. Instead of mapping speech features to waveforms deterministically, VoxGenesis transforms a Gaussian distribution into speech distributions conditioned and aligned by semantic tokens. This forces the model to learn a speaker distribution disentangled from the semantic content. During the inference, sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics. More importantly, the exploration of latent space uncovers human-interpretable directions associated with specific speaker characteristics such as gender attributes, pitch, tone, and emotion, allowing for voice editing by manipulating the latent codes along these identified directions. We conduct extensive experiments to evaluate the proposed VoxGenesis using both subjective and objective metrics, finding that it produces significantly more diverse and realistic speakers with distinct characteristics than the previous approaches. We also show that latent space manipulation produces consistent and human-identifiable effects that are not detrimental to the speech quality, which was not possible with previous approaches. Audio samples of VoxGenesis can be found at: \url{https://bit.ly/VoxGenesis}.

artificial intelligence, machine learning, speaker encoder, (12 more...)

arXiv.org Artificial Intelligence

2403.00529

Country:

North America > United States > California (0.14)
North America > Canada (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Liu, Xuechen, Sahidullah, Md, Lee, Kong Aik, Kinnunen, Tomi

arXiv.org Artificial IntelligenceJan-27-2024

It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.

artificial intelligence, machine learning, speaker verification, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TASLP.2024.3358056

2401.11156

Country: Asia > India > West Bengal (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

An Empirical Bayes Framework for Open-Domain Dialogue Generation

Lee, Jing Yang, Lee, Kong Aik, Gan, Woon-Seng

arXiv.org Artificial IntelligenceNov-17-2023

To engage human users in meaningful conversation, open-domain dialogue agents are required to generate diverse and contextually coherent dialogue. Despite recent advancements, which can be attributed to the usage of pretrained language models, the generation of diverse and coherent dialogue remains an open research problem. A popular approach to address this issue involves the adaptation of variational frameworks. However, while these approaches successfully improve diversity, they tend to compromise on contextual coherence. Hence, we propose the Bayesian Open-domain Dialogue with Empirical Bayes (BODEB) framework, an empirical bayes framework for constructing an Bayesian open-domain dialogue agent by leveraging pretrained parameters to inform the prior and posterior parameter distributions. Empirical results show that BODEB achieves better results in terms of both diversity and coherence compared to variational frameworks.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2311.10945

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.88)

Add feedback

Partially Randomizing Transformer Weights for Dialogue Response Diversity

Lee, Jing Yang, Lee, Kong Aik, Gan, Woon-Seng

arXiv.org Artificial IntelligenceNov-17-2023

Despite recent progress in generative open-domain dialogue, the issue of low response diversity persists. Prior works have addressed this issue via either novel objective functions, alternative learning approaches such as variational frameworks, or architectural extensions such as the Randomized Link (RL) Transformer. However, these approaches typically entail either additional difficulties during training/inference, or a significant increase in model size and complexity. Hence, we propose the \underline{Pa}rtially \underline{Ra}ndomized trans\underline{Former} (PaRaFormer), a simple extension of the transformer which involves freezing the weights of selected layers after random initialization. Experimental results reveal that the performance of the PaRaformer is comparable to that of the aforementioned approaches, despite not entailing any additional training difficulty or increase in model complexity.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2311.10943

Country:

Europe (1.00)
North America > United States > Texas (0.14)
North America > United States > Louisiana (0.14)
North America > United States > California (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Liu, Tianchi, Lee, Kong Aik, Wang, Qiongqiong, Li, Haizhou

arXiv.org Artificial IntelligenceNov-1-2023

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.

disentangling voice and content, self-supervision, speaker recognition

arXiv.org Artificial Intelligence

2310.01128

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.60)

Add feedback

t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Kinnunen, Tomi, Lee, Kong Aik, Tak, Hemlata, Evans, Nicholas, Nautsch, Andreas

arXiv.org Artificial IntelligenceSep-21-2023

Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

artificial intelligence, false alarm rate, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TPAMI.2023.3313648

2309.12237

Country:

Europe > Finland (0.14)
Europe > Switzerland (0.14)
Europe > Germany (0.14)
(3 more...)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
(2 more...)

Add feedback

Towards single integrated spoofing-aware speaker verification embeddings

Mun, Sung Hwan, Shim, Hye-jin, Tak, Hemlata, Wang, Xin, Liu, Xuechen, Sahidullah, Md, Jeong, Myeonghun, Han, Min Hyun, Todisco, Massimiliano, Lee, Kong Aik, Yamagishi, Junichi, Evans, Nicholas, Kinnunen, Tomi, Kim, Nam Soo, Jung, Jee-weon

arXiv.org Artificial IntelligenceJun-1-2023

This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.

artificial intelligence, machine learning, speaker verification, (19 more...)

arXiv.org Artificial Intelligence

2305.19051

Country:

Asia > South Korea (0.14)
North America > United States (0.14)

Genre: Research Report (0.70)

Industry: Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.85)

Add feedback