AITopics | Liu, Tianchi

Collaborating Authors

Liu, Tianchi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Audio-FLAN: A Preliminary Release

Xue, Liumeng, Zhou, Ziya, Pan, Jiahao, Li, Zixuan, Fan, Shuai, Ma, Yinghao, Cheng, Sitong, Yang, Dongchao, Guo, Haohan, Xiao, Yujia, Wang, Xinsheng, Shen, Zixuan, Zhu, Chuanbo, Zhang, Xinshen, Liu, Tianchi, Yuan, Ruibin, Tian, Zeyue, Liu, Haohe, Benetos, Emmanouil, Zhang, Ge, Guo, Yike, Xue, Wei

arXiv.org Artificial IntelligenceFeb-23-2025

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.16584

Country:

Asia > China (0.28)
North America > United States (0.28)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification

Ma, Yi, Wang, Shuai, Liu, Tianchi, Li, Haizhou

arXiv.org Artificial IntelligenceJan-14-2025

In speaker verification, we use computational method to verify if an utterance matches the identity of an enrolled speaker. This task is similar to the manual task of forensic voice comparison, where linguistic analysis is combined with auditory measurements to compare and evaluate voice samples. Despite much success, we have yet to develop a speaker verification system that offers explainable results comparable to those from manual forensic voice comparison. A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait which describes the speaker's characteristics at the phonetic level, resembling what forensic comparison does. ExPO not only generates utterance-level speaker embeddings but also allows for fine-grained analysis and visualization of phonetic traits, offering an explainable speaker verification process. Furthermore, we investigate phonetic traits from within-speaker and between-speaker variation perspectives to determine which trait is most effective for speaker verification, marking an important step towards explainable speaker verification. Our code is available at https://github.com/mmmmayi/ExPO.

artificial intelligence, phonetic trait, survey article, (18 more...)

arXiv.org Artificial Intelligence

2501.05729

Country: Asia > China (0.29)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Acoustic Processing (1.00)

Add feedback

MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond

Huzaifah, Muhammad, Lin, Geyu, Liu, Tianchi, Sailor, Hardik B., Tan, Kye Min, Vangani, Tarun K., Wang, Qiongqiong, Wong, Jeremy H. M., Chen, Nancy F., Aw, Ai Ti

arXiv.org Artificial IntelligenceDec-20-2024

This technical report describes the MERaLiON-SpeechEncoder, a foundation model designed to support a wide range of downstream speech applications. Developed as part of Singapore's National Multimodal Large Language Model Programme, the MERaLiON-SpeechEncoder is tailored to address the speech processing needs in Singapore and the surrounding Southeast Asian region. The model currently supports mainly English, including the variety spoken in Singapore. We are actively expanding our datasets to gradually cover other languages in subsequent releases. The MERaLiON-SpeechEncoder was pre-trained from scratch on 200,000 hours of unlabelled speech data using a self-supervised learning approach based on masked language modelling. We describe our training procedure and hyperparameter tuning experiments in detail below. Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders across ten other speech tasks. We commit to releasing our model, supporting broader research endeavours, both in Singapore and beyond.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.11538

Country: Asia > Singapore (1.00)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

Pan, Zihan, Liu, Tianchi, Sailor, Hardik B., Wang, Qiongqiong

arXiv.org Artificial IntelligenceJun-12-2024

Self-supervised learning (SSL) speech representation models, trained on large speech corpora, have demonstrated effectiveness in extracting hierarchical speech embeddings through multiple transformer layers. However, the behavior of these embeddings in specific tasks remains uncertain. This paper investigates the multi-layer behavior of the WavLM model in anti-spoofing and proposes an attentive merging method to leverage the hierarchical hidden embeddings. Results demonstrate the feasibility of fine-tuning WavLM to achieve the best equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA, 2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the early hidden transformer layers of the WavLM large model contribute significantly to anti-spoofing task, enabling computational efficiency by utilizing a partial pre-trained model.

anti-spoofing task, artificial intelligence, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2406.10283

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (0.31)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?

Liu, Tianchi, Zhang, Lin, Das, Rohan Kumar, Ma, Yi, Tao, Ruijie, Li, Haizhou

arXiv.org Artificial IntelligenceJun-4-2024

Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2406.02483

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (0.71)
Information Technology > Artificial Intelligence > Natural Language (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Liu, Tianchi, Lee, Kong Aik, Wang, Qiongqiong, Li, Haizhou

arXiv.org Artificial IntelligenceNov-1-2023

For speaker recognition, it is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is realized with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF, respectively. Since neither additional model training nor data is specifically needed, it is easily applicable in practical use.

disentangling voice and content, self-supervision, speaker recognition

arXiv.org Artificial Intelligence

2310.01128

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Speech Recognition (0.60)

Add feedback