pinyin
Chain of Correction for Full-text Speech Recognition with Large Language Models
Tang, Zhiyuan, Wang, Dong, Zhou, Zhikai, Liu, Yong, Huang, Shen, Shang, Shidong
Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC's performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
Deng, Wei, Zhou, Siyi, Shu, Jingchen, Wang, Jinchao, Wang, Lu
Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.
Learning from Mistakes: Self-correct Adversarial Training for Chinese Unnatural Text Correction
Feng, Xuan, Gu, Tianlong, Liu, Xiaoli, Chang, Liang
Unnatural text correction aims to automatically detect and correct spelling errors or adversarial perturbation errors in sentences. Existing methods typically rely on fine-tuning or adversarial training to correct errors, which have achieved significant success. However, these methods exhibit poor generalization performance due to the difference in data distribution between training data and real-world scenarios, known as the exposure bias problem. In this paper, we propose a self-correct adversarial training framework for \textbf{L}earn\textbf{I}ng from \textbf{MI}s\textbf{T}akes (\textbf{LIMIT}), which is a task- and model-independent framework to correct unnatural errors or mistakes. Specifically, we fully utilize errors generated by the model that are actively exposed during the inference phase, i.e., predictions that are inconsistent with the target. This training method not only simulates potential errors in real application scenarios, but also mitigates the exposure bias of the traditional training process. Meanwhile, we design a novel decoding intervention strategy to maintain semantic consistency. Extensive experimental results on Chinese unnatural text error correction datasets show that our proposed method can correct multiple forms of errors and outperforms the state-of-the-art text correction methods. In addition, extensive results on Chinese and English datasets validate that LIMIT can serve as a plug-and-play defense module and can extend to new models and datasets without further training.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (7 more...)
Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs
Yuhang, Yang, Yizhou, Peng, Chng, Eng Siong, Zhong, Xionghu
The integration of large language models (LLMs) with pre-trained speech models has opened up new avenues in automatic speech recognition (ASR). While LLMs excel in multimodal understanding tasks, effectively leveraging their capabilities for ASR remains a significant challenge. This paper presents a novel training approach to enhance LLM performance in ASR tasks. We propose pre-training LLMs on Pinyin embedding sequences, which represent pronunciation features, to generate corresponding Chinese characters. This step enables the LLM to adapt to generating text from pronunciation features before encountering real speech data. Furthermore, we fine-tune the LoRA parameters to enhance the LLM's understanding of speech modality information. In AISHELL-1 corpus, our approach yields a 9.5% relative improvement in ASR tasks compared to the baseline without Pinyi-to-Character pre-training. Additionally, incorporating auxiliary text data for Pinyi-to-Character pre-training further boosts performance, achieving a 19.0% relative improvement.
Large Language Model Should Understand Pinyin for Chinese ASR Error Correction
Li, Yuang, Qiao, Xiaosong, Zhao, Xiaofeng, Zhao, Huan, Tang, Wei, Zhang, Min, Yang, Hao
Large language models can enhance automatic speech recognition systems through generative error correction. In this paper, we propose Pinyin-enhanced GEC, which leverages Pinyi, the phonetic representation of Mandarin Chinese, as supplementary information to improve Chinese ASR error correction. Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference. Additionally, we introduce a multitask training approach involving conversion tasks between Pinyin and text to align their feature spaces. Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input. More importantly, we provide intuitive explanations for the effectiveness of PY-GEC and multitask training from two aspects: 1) increased attention weight on Pinyin features; and 2) aligned feature space between Pinyin and text hidden states.
- South America > Paraguay > Asunción > Asunción (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- (3 more...)
Can Pre-trained Language Models Understand Chinese Humor?
Chen, Yuyan, Li, Zhixu, Liang, Jiaqing, Xiao, Yanghua, Liu, Bang, Chen, Yunwen
Humor understanding is an important and challenging research in natural language processing. As the popularity of pre-trained language models (PLMs), some recent work makes preliminary attempts to adopt PLMs for humor recognition and generation. However, these simple attempts do not substantially answer the question: {\em whether PLMs are capable of humor understanding?} This paper is the first work that systematically investigates the humor understanding ability of PLMs. For this purpose, a comprehensive framework with three evaluation steps and four evaluation tasks is designed. We also construct a comprehensive Chinese humor dataset, which can fully meet all the data requirements of the proposed evaluation framework. Our empirical study on the Chinese humor dataset yields some valuable observations, which are of great guiding value for future optimization of PLMs in humor understanding and generation.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore > Central Region > Singapore (0.05)
- Asia > China > Shanghai > Shanghai (0.05)
- (10 more...)
Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models
Tang, Zhiyuan, Wang, Dong, Huang, Shen, Shang, Shidong
Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.
MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition
In Chinese Named Entity Recognition, character substitution is a complicated linguistic phenomenon. Some Chinese characters are quite similar as they share the same components or have similar pronunciations. People replace characters in a named entity with similar characters to generate a new collocation but referring to the same object. As a result, it always leads to unrecognizable or mislabeling errors in the NER task. In this paper, we propose a lightweight method, MFE-NER, which fuses glyph and phonetic features, to help pre-trained language models handle the character substitution problem in the NER task with limited extra cost. Basically, in the glyph domain, we disassemble Chinese characters into Five-Stroke components to represent structure features. In the phonetic domain, an improved phonetic system is proposed in our work, making it reasonable to describe phonetic similarity among Chinese characters. Experiments demonstrate that our method performs especially well in detecting character substitutions while slightly improving the overall performance of Chinese NER.
- Asia > China > Shanghai > Shanghai (0.05)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Exploring Chinese Humor Generation: A Study on Two-Part Allegorical Sayings
Humor, a culturally nuanced aspect of human language, poses challenges for computational understanding and generation, especially in Chinese humor, which remains relatively unexplored in the NLP community. This paper investigates the capability of state-of-the-art language models to comprehend and generate Chinese humor, specifically focusing on training them to create allegorical sayings. We employ two prominent training methods: fine-tuning a medium-sized language model and prompting a large one. Our novel fine-tuning approach incorporates fused Pinyin embeddings to consider homophones and employs contrastive learning with synthetic hard negatives to distinguish humor elements. Human-annotated results show that these models can generate humorous allegorical sayings, with prompting proving to be a practical and effective method. However, there is still room for improvement in generating allegorical sayings that match human creativity.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
External Knowledge Augmented Polyphone Disambiguation Using Large Language Model
One of the key issues in Mandarin Chinese text-to-speech (TTS) systems is polyphone disambiguation when doing grapheme-to-phoneme (G2P) conversion. In this paper, we introduce a novel method to solve the problem as a generation task. Following the trending research of large language models (LLM) and prompt learning, the proposed method consists of three modules. Retrieval module incorporates external knowledge which is a multi-level semantic dictionary of Chinese polyphonic characters to format the sentence into a prompt. Generation module adopts the decoder-only Transformer architecture to induce the target text. Postprocess module corrects the generated text into a valid result if needed. Experimental results show that our method outperforms the existing methods on a public dataset called CPP. We also empirically study the impacts of different templates of the prompt, different sizes of training data, and whether to incorporate external knowledge.