Goto

Collaborating Authors

 mandarin chinese


Unsupervised Learning and Representation of Mandarin Tonal Categories by a Generative CNN

arXiv.org Artificial Intelligence

This paper outlines the methodology for modeling tonal learning in fully unsupervised models of human language acquisition. Tonal patterns are among the computationally most complex learning objectives in language. We argue that a realistic generative model of human language (ciwGAN) can learn to associate its categorical variables with Mandarin Chinese tonal categories without any labeled data. All three trained models showed statistically significant differences in F0 across categorical variables. The model trained solely on male tokens consistently encoded tone. Our results sug- gest that not only does the model learn Mandarin tonal contrasts, but it learns a system that corresponds to a stage of acquisition in human language learners. We also outline methodology for tracing tonal representations in internal convolutional layers, which shows that linguistic tools can contribute to interpretability of deep learning and can ultimately be used in neural experiments.


Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models

arXiv.org Artificial Intelligence

This study explores the use of self-supervised learning (SSL) models for tone recognition in three low-resource languages from North Eastern India: Angami, Ao, and Mizo. We evaluate four Wav2vec2.0 base models that were pre-trained on both tonal and non-tonal languages. We analyze tone-wise performance across the layers for all three languages and compare the different models. Our results show that tone recognition works best for Mizo and worst for Angami. The middle layers of the SSL models are the most important for tone recognition, regardless of the pre-training language, i.e. tonal or non-tonal. We have also found that the tone inventory, tone types, and dialectal variations affect tone recognition. These findings provide useful insights into the strengths and weaknesses of SSL-based embeddings for tonal languages and highlight the potential for improving tone recognition in low-resource settings. The source code is available at GitHub 1 .


Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

arXiv.org Artificial Intelligence

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations. With experiments conducted in multiple datasets in Mandarin Chinese and Korean, we show that PET models consistently improve speech recognition accuracy compared to conventional Transducers. Our investigation also uncovers a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones. Our analysis shows that PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one. Our implementation will be open-sourced with the NeMo toolkit.


Seamless: Multilingual Expressive and Streaming Speech Translation

arXiv.org Artificial Intelligence

Large-scale automatic speech translation systems today lack key features that help machine-mediated communication feel seamless when compared to human-to-human dialogue. In this work, we introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion. First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model, incorporating an updated UnitY2 framework, was trained on more low-resource language data. SeamlessM4T v2 provides the foundation on which our next two models are initiated. SeamlessExpressive enables translation that preserves vocal styles and prosody. Compared to previous efforts in expressive speech research, our work addresses certain underexplored aspects of prosody, such as speech rate and pauses, while also preserving the style of one's voice. As for SeamlessStreaming, our model leverages the Efficient Monotonic Multihead Attention mechanism to generate low-latency target translations without waiting for complete source utterances. As the first of its kind, SeamlessStreaming enables simultaneous speech-to-speech/text translation for multiple source and target languages. To ensure that our models can be used safely and responsibly, we implemented the first known red-teaming effort for multimodal machine translation, a system for the detection and mitigation of added toxicity, a systematic evaluation of gender bias, and an inaudible localized watermarking mechanism designed to dampen the impact of deepfakes. Consequently, we bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time. The contributions to this work are publicly released and accessible at https://github.com/facebookresearch/seamless_communication


The Impact of Familiarity on Naming Variation: A Study on Object Naming in Mandarin Chinese

arXiv.org Artificial Intelligence

Different speakers often produce different names for the same object or entity (e.g., "woman" vs. "tourist" for a female tourist). The reasons behind variation in naming are not well understood. We create a Language and Vision dataset for Mandarin Chinese that provides an average of 20 names for 1319 naturalistic images, and investigate how familiarity with a given kind of object relates to the degree of naming variation it triggers across subjects. We propose that familiarity influences naming variation in two competing ways: increasing familiarity can either expand vocabulary, leading to higher variation, or promote convergence on conventional names, thereby reducing variation. We find evidence for both factors being at play. Our study illustrates how computational resources can be used to address research questions in Cognitive Science.


Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation

arXiv.org Artificial Intelligence

Added toxicity in the context of translation refers to the fact of producing a translation output with more toxicity than there exists in the input. In this paper, we present MinTox which is a novel pipeline to identify added toxicity and mitigate this issue which works at inference time. MinTox uses a toxicity detection classifier which is multimodal (speech and text) and works in languages at scale. The mitigation method is applied to languages at scale and directly in text outputs. MinTox is applied to SEAMLESSM4T, which is the latest multimodal and massively multilingual machine translation system. For this system, MinTox achieves significant added toxicity mitigation across domains, modalities and language directions. MinTox manages to approximately filter out from 25% to 95% of added toxicity (depending on the modality and domain) while keeping translation quality.


Back-Translation-Style Data Augmentation for Mandarin Chinese Polyphone Disambiguation

arXiv.org Artificial Intelligence

Conversion of Chinese Grapheme-to-Phoneme (G2P) plays an important role in Mandarin Chinese Text-To-Speech (TTS) systems, where one of the biggest challenges is the task of polyphone disambiguation. Most of the previous polyphone disambiguation models are trained on manually annotated datasets, and publicly available datasets for polyphone disambiguation are scarce. In this paper we propose a simple back-translation-style data augmentation method for mandarin Chinese polyphone disambiguation, utilizing a large amount of unlabeled text data. Inspired by the back-translation technique proposed in the field of machine translation, we build a Grapheme-to-Phoneme (G2P) model to predict the pronunciation of polyphonic character, and a Phoneme-to-Grapheme (P2G) model to predict pronunciation into text. Meanwhile, a window-based matching strategy and a multi-model scoring strategy are proposed to judge the correctness of the pseudo-label. We design a data balance strategy to improve the accuracy of some typical polyphonic characters in the training set with imbalanced distribution or data scarcity. The experimental result shows the effectiveness of the proposed back-translation-style data augmentation method.


Chinese Learners' Phonetic Transfer of /i/ from Mandarin Chinese to General American English: A Case Study of a Chinese Learner with Advanced English

arXiv.org Artificial Intelligence

The current paper concerns language transfer at the phonetic level and concentrates on the transfer phenomenon in an advanced English language learner's acquisition of the English vowels /i/ and its lax counterpart. By determining whether the Chinese English-language learner (ELL), named Vanya, can accurately distinguish between /i/ and its lax counterpart, and pronounce them precisely in General American English (GAE), this paper serves as a reference for further studying language transfer among Chinese ELLs. There were two objectives: first, the learner's perceptual ability to distinguish between vowels /i/ and its lax counterpart was examined; second, the effect of the phonetic transfer was determined. Two perception tests and a production test were used to attain these two objectives. The results of two perception tests demonstrated Vanya's perceptual competence in distinguishing between /i/ and its lax counterpart and laid a solid foundation for the validity of the subsequent production test. Given that Vanya's production of F1 and F2 values of /i/ were highly similar across his first language (Mandarin Chinese) and second language (GAE) and that both values were lower than the typical values for common /i/ in GAE, with an especially prominent disparity between the F2 values, it is reasonable to conclude that a phonetic transfer occurred. The participant's high perceptual competence as an advanced-level ELL did not noticeably moderate the effect of phonetic transfer.


Language: People across cultures agree the word 'bouba' sounds round while 'kiki' sounds pointy

Daily Mail - Science & tech

From English to Chinese, Hungarian and Zulu, people who speak different languages make the same links between sounds and shapes, a new study shows. An international research team has conducted the largest cross-cultural test of the'bouba-kiki effect' – the tendency to associate made-up words'bouba' with a round shape and'kiki' with a spiky shape. The researchers surveyed 917 speakers of 25 different languages representing nine language families and 10 writing systems. They found the effect exists independently of the language that a person speaks or the writing system that they use, whether it's the Roman alphabet (A, B, C), the Greek alphabet (alpha, beta, gamma) or Chinese characters (北, 方, 话). Such universally-meaningful vocalisations may form a global basis for the creation of new words, such as terms that circulate on social media. Bouba and kiki shapes used in the experiment.


Visually Grounded Reasoning across Languages and Cultures

arXiv.org Artificial Intelligence

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.