Goto

Collaborating Authors

 syllable


SAND Challenge: Four Approaches for Dysartria Severity Classification

Deshpande, Gauri, Battula, Harish, Panda, Ashish, Kopparapu, Sunil Kumar

arXiv.org Artificial Intelligence

This paper presents a unified study of four distinct modeling approaches for classifying dysarthria severity in the Speech Analysis for Neurodegenerative Diseases (SAND) challenge. All models tackle the same five class classification task using a common dataset of speech recordings. We investigate: (1) a ViT-OF method leveraging a Vision Transformer on spectrogram images, (2) a 1D-CNN approach using eight 1-D CNN's with majority-vote fusion, (3) a BiLSTM-OF approach using nine BiLSTM models with majority vote fusion, and (4) a Hierarchical XGBoost ensemble that combines glottal and formant features through a two stage learning framework. Each method is described, and their performances on a validation set of 53 speakers are compared. Results show that while the feature-engineered XGBoost ensemble achieves the highest macro-F1 (0.86), the deep learning models (ViT, CNN, BiLSTM) attain competitive F1-scores (0.70) and offer complementary insights into the problem.


Lips-Jaw and Tongue-Jaw Articulatory Tradeoff in DYNARTmo

Kröger, Bernd J.

arXiv.org Artificial Intelligence

This paper investigates how the dynamic articulatory model DYNARTmo accounts for articulatory tradeoffs between primary and secondary articulators, with a focus on lips-jaw and tongue-jaw coordination. While DYNARTmo does not implement full task-dynamic second-order biomechanics, it adopts first-order task-space gesture specifications comparable to those used in articulatory phonology and integrates a simplified mechanism for distributing articulatory effort across multiple articulators. We first outline the conceptual relationship between task dynamics and DYNARTmo, emphasizing the distinction between high-level task-space trajectories and their low-level articulatory execution. We then present simulation results for a set of CV syllables that illustrate how jaw displacement varies as a function of both place of articulation (labial, apical, dorsal) and vowel context (/a/, /i/, /u/). The model reproduces empirically attested patterns of articulatory synergy, including jaw-supported apical closures, lower-lip elevation in bilabial stops, tongue-jaw co-movement, and saturation effects in labial constrictions. These results demonstrate that even with computationally simplified assumptions, DYNARTmo can generate realistic spatio-temporal movement patterns that capture key aspects of articulatory tradeoff and synergy across a range of consonant-vowel combinations.


The probabilistic framework we developed is intended to provide a set of tools for understanding the relationship

Neural Information Processing Systems

We would like to thank the reviewers for their thoughtful comments and questions. We address reviewer concerns for these four tasks separately. We agree with this observation, and note that the use of the CAE is not critical to downstream analyses. ARHMM as a prior model of behavior which is then incorporated into the Bayesian decoder. Though we do not further pursue this variability in the manuscript (e.g.


Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

Ghaffari, Houtan, Rauch, Lukas, Devos, Paul

arXiv.org Artificial Intelligence

Many bioacoustics, neuroscience, and linguistics research utilize birdsongs as proxy models to acquire knowledge in diverse areas. Developing models generally requires precisely annotated data at the level of syllables. Hence, automated and data-efficient methods that reduce annotation costs are in demand. This work presents a lightweight, yet performant neural network architecture for birdsong annotation called Residual-MLP-RNN. Then, it presents a robust three-stage training pipeline for developing reliable deep birdsong syllable detectors with minimal expert labor. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentations to create a robust model for frame-level syllable detection. The third stage is semi-supervised post-training, which leverages the unlabeled data again. However, unlike the initial phase, this time it is aligned with the downstream task. The performance of this data-efficient approach is demonstrated for the complex song of the Canary in extreme label-scarcity scenarios. Canary has one of the most difficult songs to annotate, which implicitly validates the method for other birds. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.



Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

Wang, Chao, Cai, Yuqing, Duojie, Renzeng, Zhang, Jin, Liu, Yutong, Tashi, Nyima

arXiv.org Artificial Intelligence

ABSTRACT In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.


The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures

Kröger, Bernd J.

arXiv.org Artificial Intelligence

The neural generation and control of speech utterances is a complex process that is still not fully understood. However, several neurobiologically inspired models have been proposed that describe the hierarchical control concept of utterance generation (e.g., Hickok and Poeppel (2012); Bohland et al. (2010); Kröger et al. (2020); Parrell et al. (2018)). This process begins with the neural activation of the cognitive-linguistic representation of an utterance, followed by a higher-level premotor representation, leading to neuromuscular activation patterns, and finally to the articulatory-acoustic realization of the utterance (cf.


Real-time and Zero-footprint Bag of Synthetic Syllables Algorithm for E-mail Spam Detection Using Subject Line and Short Text Fields

Selitskiy, Stanislav

arXiv.org Artificial Intelligence

Contemporary e-mail services have high availability expectations from the customers and are resource-strained because of the high-volume throughput and spam attacks. Deep Machine Learning architectures, which are resource hungry and require off-line processing due to the long processing times, are not acceptable at the front line filters. On the other hand, the bulk of the incoming spam is not sophisticated enough to bypass even the simplest algorithms. While the small fraction of the intelligent, highly mutable spam can be detected only by the deep architectures, the stress on them can be unloaded by the simple near real-time and near zero-footprint algorithms such as the Bag of Synthetic Syllables algorithm applied to the short texts of the e-mail subject lines and other short text fields. The proposed algorithm creates a circa 200 sparse dimensional hash or vector for each e-mail subject line that can be compared for the cosine or euclidean proximity distance to find similarities to the known spammy subjects. The algorithm does not require any persistent storage, dictionaries, additional hardware upgrades or software packages. The performance of the algorithm is presented on the one day of the real SMTP traffic.


IASC: Interactive Agentic System for ConLangs

Taguchi, Chihiro, Sproat, Richard

arXiv.org Artificial Intelligence

We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. https://github.com/SakanaAI/IASC


Declarative Techniques for NL Queries over Heterogeneous Data

Khabiri, Elham, Kephart, Jeffrey O., Heath, Fenno F. III, Jayaraman, Srideepika, Tipu, Fateh A., Li, Yingjie, Shah, Dhruv, Fokoue, Achille, Bhamidipaty, Anu

arXiv.org Artificial Intelligence

In many industrial settings, users wish to ask questions in natural language, the answers to which require assembling information from diverse structured data sources. With the advent of Large Language Models (LLMs), applications can now translate natural language questions into a set of API calls or database calls, execute them, and combine the results into an appropriate natural language response. However, these applications remain impractical in realistic industrial settings because they do not cope with the data source heterogeneity that typifies such environments. In this work, we simulate the heterogeneity of real industry settings by introducing two extensions of the popular Spider benchmark dataset that require a combination of database and API calls. Then, we introduce a declarative approach to handling such data heterogeneity and demonstrate that it copes with data source heterogeneity significantly better than state-of-the-art LLM-based agentic or imperative code generation systems. Our augmented benchmarks are available to the research community.