reduplication
Distinguishing Repetition Disfluency from Morphological Reduplication in Bangla ASR Transcripts: A Novel Corpus and Benchmarking Analysis
Arpa, Zaara Zabeen, Apurbo, Sadnam Sakib, Oishee, Nazia Karim Khan, Abrar, Ajwad
Automatic Speech Recognition (ASR) is now integral to digital interaction, driving applications from virtual assistants to automated subtitles on platforms like YouTube ( Dykes et al. 2023). Despite its wide adoption and significant advancements, ASR performance remains imperfect, particularly in Large Vocabulary Continuous Speech Recognition (L VCSR) and under real-world conditions like background noise or diverse accents ( Errattahi et al. 2018; Romana et al. 2024). This often results in a significant Word Error Rate (WER). A major, persistent source of error stems from speech disfluencies, which are interruptions in the smooth flow of speech, including filled pauses, hesitations, self-corrections, and most relevantly, repetitions ( Romana et al. 2024). Disfluencies are natural and frequent; one study found a 50% probability of a disfluency in a 10-13 word sentence ( Jamshid Lou and Johnson 2020b). Their presence creates noisy transcripts that are difficult to read and detrimental to downstream Natural Language Processing (NLP) tasks such as machine translation or information extraction ( Romana et al. 2024; Jamshid Lou and Johnson 2020a).
Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication
Ahmad, Arif, Khyathi, Mothika Gayathri, Bhattacharyya, Pushpak
Reduplication and repetition, though similar in form, serve distinct linguistic purposes. Reduplication is a deliberate morphological process used to express grammatical, semantic, or pragmatic nuances, while repetition is often unintentional and indicative of disfluency. This paper presents the first large-scale study of reduplication and repetition in speech using computational linguistics. We introduce IndicRedRep, a new publicly available dataset containing Hindi, Telugu, and Marathi text annotated with reduplication and repetition at the word level. We evaluate transformer-based models for multi-class reduplication and repetition token classification, utilizing the Reparandum-Interregnum-Repair structure to distinguish between the two phenomena. Our models achieve macro F1 scores of up to 85.62% in Hindi, 83.95% in Telugu, and 84.82% in Marathi for reduplication-repetition classification.
Using Model-Theoretic Approaches to Uncover Linguistic Organization
Various scholars have proposed the idea that there are different ways for a form-meaning pairing to be iconic, and that these different types of iconicity may interact with one another (Buchler, 1986; Reiger, 1998; Rozhansky, 2015). As a way of formalizing this idea, Lǐ and Ponsford (2018) identify five features pertaining to the form of fully reduplicated words that are in an iconic relationship with some aspect of a meaning that was found to be marked by total reduplication. Based on these formal features, they propose the following five dimensions of iconicity ('iconicities' in Lǐ and Ponsford (2018)) that can be manifested by reduplication patterns: (1) Balinese Pluractional markers keplug'explode' keplug~keplug'explode repeatedly' pa-keplug'X (plural) explode simultaneously' (Arka and Dalrymple, 2017) Notice that the repeated-explosion event is marked by a form that repeats keplug, while the event where all of the explosions happen at once (no repetition) is marked by a form that does not involve any repetition. Viewed through this lens, the Balinese pluractional prefix pa-is not entirely arbitrary, because it highlights the distinction between two types of pluractionality that are marked in Balinese. This is a case of iconicity because a property of the form (repetition or non-repetition) is also a property of the associated meaning. In this paper, we consider pluractional markers in Kaqchikel, Karuk, and Yurok.
More than Just Statistical Recurrence: Human and Machine Unsupervised Learning of M\=aori Word Segmentation across Morphological Processes
Varatharaj, Ashvini, Todd, Simon
Non-M\=aori-speaking New Zealanders (NMS)are able to segment M\=aori words in a highlysimilar way to fluent speakers (Panther et al.,2024). This ability is assumed to derive through the identification and extraction of statistically recurrent forms. We examine this assumption by asking how NMS segmentations compare to those produced by Morfessor, an unsupervised machine learning model that operates based on statistical recurrence, across words formed by a variety of morphological processes. Both NMS and Morfessor succeed in segmenting words formed by concatenative processes (compounding and affixation without allomorphy), but NMS also succeed for words that invoke templates (reduplication and allomorphy) and other cues to morphological structure, implying that their learning process is sensitive to more than just statistical recurrence.
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Beguš, Gašper, Lu, Thomas, Wang, Zili
Computational models of syntax are predominantly text-based. Here we propose that basic syntax can be modeled directly from raw speech in a fully unsupervised way. We focus on one of the most ubiquitous and basic properties of syntax -- concatenation. We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words concatenated without ever accessing data with multiple words in the input. Additionally, networks trained on two words learn to embed words into novel unobserved word combinations. To our knowledge, this is a previously unreported property of CNNs trained on raw speech in the Generative Adversarial Network setting and has implications both for our understanding of how these architectures learn as well as for modeling syntax and its evolution from raw acoustic inputs.
Interpreting intermediate convolutional layers of CNNs trained on raw speech
This paper presents a technique to interpret and visualize intermediate layers in CNNs trained on raw speech data in an unsupervised manner. We show that averaging over feature maps after ReLU activation in each convolutional layer yields interpretable time-series data. The proposed technique enables acoustic analysis of intermediate convolutional layers. To uncover how meaningful representation in speech gets encoded in intermediate layers of CNNs, we manipulate individual latent variables to marginal levels outside of the training range. We train and probe internal representations on two models -- a bare WaveGAN architecture and a ciwGAN extension which forces the Generator to output informative data and results in emergence of linguistically meaningful representations. Interpretation and visualization is performed for three basic acoustic properties of speech: periodic vibration (corresponding to vowels), aperiodic noise vibration (corresponding to fricatives), and silence (corresponding to stops). We also argue that the proposed technique allows acoustic analysis of intermediate layers that parallels the acoustic analysis of human speech data: we can extract F0, intensity, duration, formants, and other acoustic properties from intermediate layers in order to test where and how CNNs encode various types of information. The models are trained on two speech processes with different degrees of complexity: a simple presence of [s] and a computationally complex presence of reduplication (copied material). Observing the causal effect between interpolation and the resulting changes in intermediate layers can reveal how individual variables get transformed into spikes in activation in intermediate layers. Using the proposed technique, we can analyze how linguistically meaningful units in speech get encoded in different convolutional layers.
Identity-Based Patterns in Deep Convolutional Networks: Generative Adversarial Phonology and Reduplication
Identity-based patterns for which a computational model needs to output some feature together with a copy of that feature are computationally challenging, but pose no problems to human learners and are common in world's languages. In this paper, we test whether a neural network can learn an identity-based pattern in speech called reduplication. To our knowledge, this is the first attempt to test identity-based patterns in deep convolutional networks trained on raw continuous data. Unlike existing proposals, we test learning in an unsupervised manner and we train the network on raw acoustic data. We use the ciwGAN architecture (Begu\v{s} 2020; arXiv:2006.02951) in which learning of meaningful representations in speech emerges from a requirement that the deep convolutional network generates informative data. Based on four generative tests, we argue that a deep convolutional network learns to represent an identity-based pattern in its latent space; by manipulating only two categorical variables in the latent space, we can actively turn an unreduplicated form into a reduplicated form with no other changes to the output in the majority of cases. We also argue that the network extends the identity-based pattern to unobserved data: when reduplication is forced in the output with the proposed technique for latent space manipulation, the network generates reduplicated data (e.g., it copies an [s] e.g. in [si-siju] for [siju] although it never sees any reduplicated forms containing an [s] in the input). Comparison with human outputs of reduplication show a high degree of similarity. Exploration of how meaningful representations of identity-based patterns emerge and how the latent space variables outside of the training range correlate with identity-based patterns in the output has general implications for neural network interpretability.