AITopics | Lai, Cheng-I

Collaborating Authors

Lai, Cheng-I

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Large-Scale Evaluation of Speech Foundation Models

Yang, Shu-wen, Chang, Heng-Jui, Huang, Zili, Liu, Andy T., Lai, Cheng-I, Wu, Haibin, Shi, Jiatong, Chang, Xuankai, Tsai, Hsiang-Sheng, Huang, Wen-Chin, Feng, Tzu-hsun, Chi, Po-Han, Lin, Yist Y., Chuang, Yung-Sung, Huang, Tzu-Hsien, Tseng, Wei-Cheng, Lakhotia, Kushal, Li, Shang-Wen, Mohamed, Abdelrahman, Watanabe, Shinji, Lee, Hung-yi

arXiv.org Artificial IntelligenceMay-29-2024

The foundation model paradigm leverages a shared foundation model to achieve state-of-the-art (SOTA) performance for various tasks, requiring minimal downstream-specific modeling and data annotation. This approach has proven crucial in the field of Natural Language Processing (NLP). However, the speech processing community lacks a similar setup to explore the paradigm systematically. In this work, we establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the paradigm for speech. We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads. Combining our results with community submissions, we verify that the foundation model paradigm is promising for speech, and our multi-tasking framework is simple yet effective, as the best-performing foundation model shows competitive generalizability across most SUPERB tasks. For reproducibility and extensibility, we have developed a long-term maintained platform that enables deterministic benchmarking, allows for result sharing via an online leaderboard, and promotes collaboration through a community-driven benchmark database to support new development cycles. Finally, we conduct a series of analyses to offer an in-depth understanding of SUPERB and speech foundation models, including information flows across tasks inside the models, the correctness of the weighted-sum benchmarking protocol and the statistical significance and robustness of the benchmark.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2404.09385

Country:

Asia (0.67)
North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

Tseng, Yuan, Lai, Cheng-I, Lee, Hung-yi

arXiv.org Artificial IntelligenceMay-9-2023

Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.

artificial intelligence, natural language, transcript, (18 more...)

arXiv.org Artificial Intelligence

2303.08809

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Towards Semi-Supervised Semantics Understanding from Speech

Lai, Cheng-I, Cao, Jin, Bodapati, Sravan, Li, Shang-Wen

arXiv.org Artificial IntelligenceNov-10-2020

Much recent work on Spoken Language Understanding (SLU) falls short in at least one of three ways: models were trained on oracle text input and neglected the Automatics Speech Recognition (ASR) outputs, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. We proposed a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus. In parallel, we identified two inadequate settings under which SLU models have been tested: noise-robustness and E2E semantics evaluation. We tested the proposed framework under realistic environmental noises and with a new metric, the slots edit F1 score, on two public SLU corpora. Experiments show that our SLU framework with speech as input can perform on par with those with oracle text as input in semantics understanding, while environmental noises are present, and a limited amount of labeled semantics data is available.

artificial intelligence, arxiv preprint arxiv, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2011.06195

Country: North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Attentive Filtering Networks for Audio Replay Attack Detection

Lai, Cheng-I, Abad, Alberto, Richmond, Korin, Yamagishi, Junichi, Dehak, Najim, King, Simon

arXiv.org Machine LearningOct-30-2018

ABSTRACT An attacker may use a variety of techniques to fool an automatic speaker verification system into accepting them as a genuine user. Anti-spoofing methods meanwhile aim to make the system robust against such attacks. The ASVspoof 2017 Challenge focused specifically on replay attacks, with the intention of measuring the limits of replay attack detection as well as developing countermeasures against them. In this work, we propose our replay attacks detection system - Attentive Filtering Network, which is composed of an attention-based filtering mechanism that enhances feature representations in both the frequency and time domains, and a ResNet-based classifier. We show that the network enables us to visualize the automatically acquired feature representations that are helpful for spoofing detection. Index Terms-- ASVspoof, Anti-Spoofing, Spoofing Attack, Replay Attacks, Automatic Speaker Verification 1. INTRODUCTION Automatic speaker verification (ASV) systems have become increasingly widespread in recent years with the advent of voice assistant and smart home devices.

asvspoof 2017, deep learning, speech recognition, (19 more...)

arXiv.org Machine Learning

1810.13048

Country: Europe > Portugal (0.14)

Genre: Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback