AITopics

Country: Asia (0.28)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsFeb-19-2026, 10:54:39 GMT

ea159dc9788ffac311592613b7f71fbb-Supplemental.pdf

phoneme, text data, unlabeled text data, (17 more...)

Country: Europe > Germany > Saxony > Leipzig (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsFeb-19-2026, 05:47:34 GMT

8a9c8ac001d3ef9e4ce39b1177295e03-Paper.pdf

Dubbing is a post-production process of re-recording actors' dialogues, which isextensively used infilmmaking and video production.

artificial intelligence, machine learning, speech, (18 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Neural Information Processing SystemsFeb-17-2026, 17:27:49 GMT

SSDM: Scalable Speech Dysfluency Modeling

However, there are three challenges.

large language model, machine learning, natural language, (20 more...)

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(6 more...)

Genre: Research Report > Experimental Study (0.93)

Industry:

Information Technology (0.92)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Neural Information Processing SystemsFeb-14-2026, 17:45:37 GMT

Untangling in Invariant Speech Recognition

Cory Stephenson, Jenelle Feather, Suchismita Padhy, Oguz Elibol, Hanlin Tang, Josh McDermott, SueYeon Chung

Meanwhile, deep neural networks have also achieved impressive performance in audio processing applications, both as sub-components of larger systems and as complete end-to-end systems by themselves.

artificial intelligence, machine learning, manifold, (19 more...)

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
South America > Paraguay > Asunción > Asunción (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Neural Information Processing SystemsFeb-11-2026, 10:54:52 GMT

0cbed40c0d920b94126eaf5e707be1f5-AuthorFeedback.pdf

inference, phoneme, prediction, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.30)

arXiv.org Artificial IntelligenceDec-8-2025

Decoding inner speech with an end-to-end brain-to-text neural interface

Zhang, Yizi, He, Linyang, Fan, Chaofei, Liu, Tingkai, Yu, Han, Le, Trung, Li, Jingyuan, Linderman, Scott, Duncker, Lea, Willett, Francis R, Mesgarani, Nima, Paninski, Liam

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

large language model, machine learning, natural language, (19 more...)

2511.2174

Country: North America > United States (0.47)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Gafos, Adamantios I., Kuberski, Stephan R.

Limit cycles for speech

arXiv.org Artificial IntelligenceDec-5-2025

Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive -- but no less principled than the conventional -- representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.

artificial intelligence, oscillator, speech, (15 more...)

2512.04642

Country: Europe > Germany (0.29)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology: Information Technology > Artificial Intelligence (1.00)

Kucukmanisa, Ayhan, Gelmez, Derya, Calik, Sukru Selim, Kilimci, Zeynep Hilal

Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition

arXiv.org Artificial IntelligenceNov-24-2025

Recent advances in multimodal deep learning have greatly enhanced the capability of systems for speech analysis and pronunciation assessment. Accurate pronunciation detection remains a key challenge in Arabic, particularly in the context of Quranic recitation, where subtle phonetic differences can alter meaning. Addressing this challenge, the present study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection that combines acoustic and textual representations to achieve higher precision and robustness. The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions, creating a unified representation that captures both phonetic detail and linguistic context. To determine the most effective integration strategy, early, intermediate, and late fusion methods were implemented and evaluated on two datasets containing 29 Arabic phonemes, including eight hafiz sounds, articulated by 11 native speakers. Additional speech samples collected from publicly available YouTube recordings were incorporated to enhance data diversity and generalization. Model performance was assessed using standard evaluation metrics: accuracy, precision, recall, and F1-score, allowing a detailed comparison of the fusion strategies. Experimental findings show that the UniSpeech-BERT multimodal configuration provides strong results and that fusion-based transformer architectures are effective for phoneme-level mispronunciation detection. The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems, offering a practical step toward technology-supported Quranic pronunciation training and broader speech-based educational applications.

artificial intelligence, machine learning, natural language, (15 more...)

2511.17477

Genre: Research Report > New Finding (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Loakman, Tyler, James, Joseph, Lin, Chenghua

Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

arXiv.org Artificial IntelligenceNov-18-2025

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

computational linguistic, large language model, natural language, (15 more...)

2511.13225

Country:

Asia (1.00)
North America > United States (0.47)
Europe > United Kingdom > England (0.28)

Genre:

Overview (0.47)
Research Report (0.40)

Industry: Education (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)