Ou, Zhijian
Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training
Dong, Lukuan, Qin, Donghong, Bai, Fengbo, Song, Fanhua, Liu, Yan, Xu, Chen, Ou, Zhijian
In our practice, it takes non-trivial efforts to collect and transcribe even less than 10 hours of Iu Mien language. The The mainstream automatic speech recognition (ASR) technology development of Iu Mien language speech recognition systems usually requires hundreds to thousands of hours of is very challenging, while it is very important to reduce digital annotated speech data. Three approaches to low-resourced divides and culture inheritance. ASR are phoneme or subword based supervised pre-training, The paradigm of pre-training (PT) followed by fine-tuning and self-supervised pre-training over multilingual data. The (FT), called the PTFT paradigm, has emerged in recent years as Iu Mien language is the main ethnic language of the Yao an effective way to solve the problem of limited training data for ethnic group in China and is low-resourced in the sense that low-resource languages for ASR. In pre-training, training data the annotated speech is very limited. With less than 10 hours for a number of languages are merged to train a multilingual of transcribed Iu Mien language, this paper investigates and model. The pre-trained model can then serve as a backbone, compares the three approaches for Iu Mien speech recognition.
Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision
Yusuyin, Saierdaer, Ma, Te, Huang, Hao, Zhao, Wenbo, Ou, Zhijian
There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training efficiency.It is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at https://github.com/thu-spmi/CAT upon publication.
Elastic CRFs for Open-ontology Slot Filling
Dai, Yinpei, Zhang, Yichi, Liu, Hong, Ou, Zhijian, Huang, Yi, Feng, Junlan
Slot filling is a crucial component in task-oriented dialog systems that is used to parse (user) utterances into semantic concepts called slots. An ontology is defined by the collection of slots and the values that each slot can take. The most widely used practice of treating slot filling as a sequence labeling task suffers from two main drawbacks. First, the ontology is usually pre-defined and fixed and therefore is not able to detect new labels for unseen slots. Second, the one-hot encoding of slot labels ignores the correlations between slots with similar semantics, which makes it difficult to share knowledge learned across different domains. To address these problems, we propose a new model called elastic conditional random field (eCRF), where each slot is represented by the embedding of its natural language description and modeled by a CRF layer. New slot values can be detected by eCRF whenever a language description is available for the slot. In our experiment, we show that eCRFs outperform existing models in both in-domain and cross-domain tasks, especially in predicting unseen slots and values.
The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)
Cai, Yucheng, Chen, Si, Huang, Yi, Feng, Junlan, Ou, Zhijian
Developing intelligent dialog systems has been one of the longest running goals in AI. In recent years, significant progress has been made in building dialog systems with the breakthrough of deep learning methods and the large amount of conversational data being made available for system development (Budzianowski et al., 2018; Ou et al., 2022a; Ouyang et al., 2022; Achiam et al., 2023). There are still full of challenges toward building future dialog systems. The first FutureDial challenge focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDial-SereTOD) (Ou et al., 2022a;b), which was successfully held at EMNLP 2022 SereTOD workshop However, problems like hallucination and fabrication (Alkaissi & McFarlane, 2023) still hinder the usage of such systems in real-life applications like customer service systems, which requires pin-point accuracy. Retrieval augmented generation (RAG) (Lewis et al., 2020; Guu et al., 2020) has been introduced to enhance dialog systems with retrieved information from external knowledge bases and has attracted increasing interests.
Energy-Based Models with Applications to Speech and Language Processing
Ou, Zhijian
Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.
Prompt Pool based Class-Incremental Continual Learning for Dialog State Tracking
Liu, Hong, Cai, Yucheng, Zhou, Yuan, Ou, Zhijian, Huang, Yi, Feng, Junlan
Continual learning is crucial for dialog state tracking (DST) in dialog systems, since requirements from users for new functionalities are often encountered. However, most of existing continual learning methods for DST require task identities during testing, which is a severe limit in real-world applications. In this paper, we aim to address continual learning of DST in the class-incremental scenario (namely the task identity is unknown in testing). Inspired by the recently emerging prompt tuning method that performs well on dialog systems, we propose to use the prompt pool method, where we maintain a pool of key-value paired prompts and select prompts from the pool according to the distance between the dialog history and the prompt keys. The proposed method can automatically identify tasks and select appropriate prompts during testing. We conduct experiments on Schema-Guided Dialog dataset (SGD) and another dataset collected from a real-world dialog application. Experiment results show that the prompt pool method achieves much higher joint goal accuracy than the baseline. After combining with a rehearsal buffer, the model performance can be further improved.
UniPCM: Universal Pre-trained Conversation Model with Task-aware Automatic Prompt
Cai, Yucheng, Ma, Wentao, Wu, Yuchuan, Si, Shuzheng, Shao, Yuan, Ou, Zhijian, Li, Yongbin
Recent research has shown that multi-task pre-training greatly improves the model's robustness and transfer ability, which is crucial for building a high-quality dialog system. However, most previous works on multi-task pre-training rely heavily on human-defined input format or prompt, which is not optimal in quality and quantity. In this work, we propose to use Task-based Automatic Prompt generation (TAP) to automatically generate high-quality prompts. Using the high-quality prompts generated, we scale the corpus of the pre-trained conversation model to 122 datasets from 15 dialog-related tasks, resulting in Universal Pre-trained Conversation Model (UniPCM), a powerful foundation model for various conversational tasks and different dialog systems. Extensive experiments have shown that UniPCM is robust to input prompts and capable of various dialog-related tasks. Moreover, UniPCM has strong transfer ability and excels at low resource scenarios, achieving SOTA results on 9 different datasets ranging from task-oriented dialog to open-domain conversation. Furthermore, we are amazed to find that TAP can generate prompts on par with those collected with crowdsourcing. The code is released with the paper.
Exploring Energy-based Language Models with Different Architectures and Training Methods for Speech Recognition
Liu, Hong, Lv, Zhaobiao, Ou, Zhijian, Zhao, Wenbo, Xiao, Qing
Energy-based language models (ELMs) parameterize an unnormalized distribution for natural sentences and are radically different from popular autoregressive language models (ALMs). As an important application, ELMs have been successfully used as a means for calculating sentence scores in speech recognition, but they all use less-modern CNN or LSTM networks. The recent progress in Transformer networks and large pretrained models such as BERT and GPT2 opens new possibility to further advancing ELMs. In this paper, we explore different architectures of energy functions and different training methods to investigate the capabilities of ELMs in rescoring for speech recognition, all using large pretrained models as backbones.
Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision
Cai, Yucheng, Liu, Hong, Ou, Zhijian, Huang, Yi, Feng, Junlan
Most existing task-oriented dialog (TOD) systems track dialog states in terms of slots and values and use them to query a database to get relevant knowledge to generate responses. In real-life applications, user utterances are noisier, and thus it is more difficult to accurately track dialog states and correctly secure relevant knowledge. Recently, a progress in question answering and document-grounded dialog systems is retrieval-augmented methods with a knowledge retriever. Inspired by such progress, we propose a retrieval-based method to enhance knowledge selection in TOD systems, which significantly outperforms the traditional database query method for real-life dialogs. Further, we develop latent variable model based semi-supervised learning, which can work with the knowledge retriever to leverage both labeled and unlabeled dialog data. Joint Stochastic Approximation (JSA) algorithm is employed for semi-supervised model training, and the whole system is referred to as that JSA-KRTOD. Experiments are conducted on a real-life dataset from China Mobile Custom-Service, called MobileCS, and show that JSA-KRTOD achieves superior performances in both labeled-only and semi-supervised settings.
Persistently Trained, Diffusion-assisted Energy-based Models
Zhang, Xinwei, Tan, Zhiqiang, Ou, Zhijian
Maximum likelihood (ML) learning for energy-based models (EBMs) is challenging, partly due to non-convergence of Markov chain Monte Carlo.Several variations of ML learning have been proposed, but existing methods all fail to achieve both post-training image generation and proper density estimation. We propose to introduce diffusion data and learn a joint EBM, called diffusion assisted-EBMs, through persistent training (i.e., using persistent contrastive divergence) with an enhanced sampling algorithm to properly sample from complex, multimodal distributions. We present results from a 2D illustrative experiment and image experiments and demonstrate that, for the first time for image data, persistently trained EBMs can {\it simultaneously} achieve long-run stability, post-training image generation, and superior out-of-distribution detection.