transcription
Model Details
We decreased the confidence threshold to 0.1 to increase article and headline The following specifications were used: { resolution: 256, learning rate: 2e-3 }. This limit is binding for common words, e.g., "the". The recognizer is trained using the Supervised Contrastive ("SupCon") loss function [7], a gener-45 In particular, we work with the "outside" SupCon loss formulation We use a MobileNetV3 (Small) encoder pre-trained on ImageNet1k sourced from the timm [19] We use 0.1 as the temperature for Center Cropping, to avoid destroying too much information. C (Small) model that is developed in [2] for character recognition. If multiple article bounding boxes satisfy these rules for a given headline, then we take the highest.
- North America > United States (0.14)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Law (1.00)
- Information Technology (1.00)
- Government (1.00)
- Asia > China > Hong Kong (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- (9 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Education (1.00)
4 Best AI Notetakers (2026), Tested and Reviewed
A growing collection of pocket-sized gadgets lets you record your meetings and extract value from them. Whether sitting in class, a meeting, or an interview, I've never been fond of taking notes, and I'm far from alone. Not only does the process of scribbling something down cause me to miss what was said immediately after, but I also suffer from awful handwriting, meaning that I can rarely read the notes anyway. Recording interviews has long been a solution, but transcribing interviews is another step (with extra cost) that can leave you with thousands of words of material to sift through, much of it irrelevant. AI notetakers--massively popular at CES 2026 --have emerged to offer a new way of making IRL notetaking easier and faster, putting the power of AI into (or at least adjacent to) a portable device that evokes the microcassette recorder of yesteryear.
- North America > United States > California (0.04)
- Europe > Slovakia (0.04)
- Europe > Czechia (0.04)
- Asia > China (0.04)
REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data. In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. REBORN alternates between (1) training a segmentation model that predicts the boundaries of the segmental structures in speech signals and (2) training the phoneme prediction model, whose input is a segmental structure segmented by the segmentation model, to predict a phoneme transcription. Since supervised data for training the segmentation model is not available, we use reinforcement learning to train the segmentation model to favor segmentations that yield phoneme sequence predictions with a lower perplexity. We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech. We comprehensively analyze why the boundaries learned by REBORN improve the unsupervised ASR performance.
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as output transcription.
Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual Speech Recognition Evaluation
Srivastav, Vaibhav, Zheng, Steven, Bezzam, Eric, Bihan, Eustache Le, Moumen, Adel, Gandhi, Sanchit
Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including a dedicated multilingual track. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
KidSpeak: A General Multi-purpose LLM for Kids' Speech Recognition and Screening
Sharma, Rohan, Liu, Dancheng, Sun, Jingchen, Zhou, Shijie, Qin, Jiayu, Xiong, Jinjun, Chen, Changyou
With the rapid advancement of conversational and diffusion-based AI, there is a growing adoption of AI in educational services, ranging from grading and assessment tools to personalized learning systems that provide targeted support for students. However, this adaptability has yet to fully extend to the domain of children's speech, where existing models often fail due to their reliance on datasets designed for clear, articulate adult speech. Children, particularly those in early developmental stages or with speech and language pathologies, present unique challenges that current AI models and datasets are ill-equipped to handle. To address this, we introduce KidSpeak, a multi-task speech-enhanced Foundation Model capable of both generative and discriminative tasks specifically tailored to children's speech patterns. Our framework employs a two-stage training process that incorporates phonetic knowledge into the speech encoder, achieving an average accuracy of 87% across four separate tasks. Furthermore, recognizing the limitations of scalable human annotation and existing speech alignment tools, we propose the Flexible and Automatic Speech Aligner (F ASA) and leverage the method to construct high quality datasets for training and evaluation. This novel alignment tool significantly improves the quality of aligned children's speech from noisy data, enhancing data quality by 13.6 compared to human annotations, as demonstrated on the CHILDES dataset. To the best of our knowledge, KidSpeak and F ASA represent the first comprehensive solution designed for speech and language therapy in children, offering both a multi-purpose speech LLM and a robust alignment tool.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Waltham (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area (0.46)
- Education > Educational Technology > Educational Software (0.34)
Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
This paper investigates the impact of using morphologically-informed tokenizers to aid and streamline the interlinear gloss annotation of an audio corpus of Yoloxóchitl Mixtec (YM) using a combination of ASR and text-based sequence-to-sequence tools, with the goal of improving efficiency while reducing the workload of a human annotator. We present two novel tokenization schemes that separate words in a nonlinear manner, preserving information about tonal morphology as much as possible. One of these approaches, a Segment and Melody tokenizer, simply extracts the tones without predicting segmentation. The other, a Sequence of Processes tokenizer, predicts segmentation for the words, which could allow an end-to-end ASR system to produce segmented and unsegmented transcriptions in a single pass. We find that these novel tokenizers are competitive with BPE and Unigram models, and the Segment-and-Melody model outperforms traditional tokenizers in terms of word error rate but does not reach the same character error rate. In addition, we analyze tokenizers on morphological and information-theoretic metrics to find predictive correlations with downstream performance. Our results suggest that nonlinear tokenizers designed specifically for the non-concatenative morphology of a language are competitive with conventional BPE and Unigram models for ASR. Further research will be necessary to determine the applicability of these tokenizers in downstream processing tasks.
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Europe > Germany > Saxony > Leipzig (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)