AITopics | Tseng, Liang-Hsuan

Collaborating Authors

Tseng, Liang-Hsuan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

Yang, Chih-Kai, Fu, Yu-Kuan, Li, Chen-An, Lin, Yi-Cheng, Lin, Yu-Xiang, Chen, Wei-Chih, Chung, Ho Lam, Kuan, Chun-Yi, Huang, Wei-Ping, Lu, Ke-Han, Lin, Tzu-Quan, Wang, Hsiu-Hsuan, Hu, En-Pei, Hsu, Chan-Jan, Tseng, Liang-Hsuan, Chiu, I-Hsiang, Sanga, Ulin, Chen, Xuanjun, Hsu, Po-chun, Yang, Shu-wen, Lee, Hung-yi

arXiv.org Artificial IntelligenceDec-27-2024

This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.07111

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.67)
Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data

Tseng, Liang-Hsuan, Chen, Zih-Ching, Chang, Wei-Shun, Lee, Cheng-Kuang, Huang, Tsung-Ren, Lee, Hung-yi

arXiv.org Artificial IntelligenceJul-15-2024

Recent advances in automatic speech recognition (ASR) often rely on large speech foundation models for generating high-quality transcriptions. However, these models can be impractical due to limited computing resources. The situation is even more severe in terms of more realistic or difficult scenarios, such as code-switching ASR (CS-ASR). To address this, we present a framework for developing more efficient models for CS-ASR through knowledge distillation using realistic speech-only data. Our proposed method, Leave No Knowledge Behind During Knowledge Distillation (K$^2$D), leverages both the teacher model's knowledge and additional insights from a small auxiliary model. We evaluate our approach on two in-domain and two out-domain datasets, demonstrating that K$^2$D is effective. By conducting K$^2$D on the unlabeled realistic data, we have successfully obtained a 2-time smaller model with 5-time faster generation speed while outperforming the baseline methods and the teacher model on all the testing sets. We have made our model publicly available on Hugging Face (https://huggingface.co/andybi7676/k2d-whisper.zh-en).

knowledge distillation, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2407.10603

Country: Asia > Taiwan (0.15)

Genre: Research Report (0.82)

Industry: Education (0.94)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR

Tseng, Liang-Hsuan, Hu, En-Pei, Chiang, Cheng-Han, Tseng, Yuan, Lee, Hung-yi, Lee, Lin-shan, Sun, Shao-Hua

arXiv.org Artificial IntelligenceFeb-6-2024

Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is a segmental structure segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.

artificial intelligence, segmentation model, speech recognition, (12 more...)

arXiv.org Artificial Intelligence

2402.03988

Country: Asia > Taiwan (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation

Fu, Yu-Kuan, Tseng, Liang-Hsuan, Shi, Jiatong, Li, Chen-An, Hsu, Tsu-Yuan, Watanabe, Shinji, Lee, Hung-yi

arXiv.org Artificial IntelligenceMay-12-2023

Most of the speech translation models heavily rely on parallel data, which is hard to collect especially for low-resource languages. To tackle this issue, we propose to build a cascaded speech translation system without leveraging any kind of paired data. We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS. The results show that our work is comparable with some other early supervised methods in some language pairs. While cascaded systems always suffer from severe error propagation problems, we proposed denoising back-translation (DBT), a novel approach to building robust unsupervised neural machine translation (UNMT). DBT successfully increases the BLEU score by 0.7--0.9 in all three translation directions. Moreover, we simplified the pipeline of our cascaded system to reduce inference latency and conducted a comprehensive analysis of every part of our work. We also demonstrate our unsupervised speech translation results on the established website.

artificial intelligence, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2305.07455

Genre:

Research Report > New Finding (0.54)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Introducing Semantics into Speech Encoders

Xu, Derek, Dong, Shuyan, Wang, Changhan, Kim, Suyoun, Lin, Zhaojiang, Shrivastava, Akshat, Li, Shang-Wen, Tseng, Liang-Hsuan, Baevski, Alexei, Lin, Guan-Ting, Lee, Hung-yi, Sun, Yizhou, Wang, Wei

arXiv.org Artificial IntelligenceNov-15-2022

Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio transcriptions, which is expensive and time-consuming to obtain. We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. By introducing semantics, we improve existing speech encoder spoken language understanding performance by over 10\% on intent classification, with modest gains in named entity resolution and slot filling, and spoken question answering FF1 score by over 2\%. Our unsupervised approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts, demonstrating the feasibility of unsupervised semantic augmentations to existing speech encoders.

artificial intelligence, natural language, text processing, (16 more...)

arXiv.org Artificial Intelligence

2211.08402

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback