Speech Recognition
Grok AI will be in Teslas as of next week, Elon Musk confirms
Hours after launching Grok 4, Elon Musk confirmed that the chatbot would soon be in Tesla vehicles. The Tesla CEO and founder of xAI shared this on X after the Grok 4 livestream. "Grok is coming to Tesla vehicles very soon. Next week at the latest," posted Musk. Musk's confirmation followed a late-night livestream announcing xAI's Grok 4, where he claimed the new AI model was smarter than "almost all graduate students, in all disciplines, simultaneously," and could "discover new physics." With all those accolades you never would have guessed that Musk's companies were having a rough week, starting with Grok's anti-semitic rant.
ElevenLabs' new AI voice assistant can automate your favorite tasks - and you can try it for free
AI assistants are cropping up faster than we can keep track of them, promising to automate our mundane daily tasks and skyrocket productivity. AI audio company ElevenLabs wants to take those promises a step further: from words to actions. Also: How AI can save us from our'infinite' workdays, according to Microsoft On Monday, the company launched 11ai, a voice assistant powered by the company's library of more than 5,000 voices. What sets this assistant apart, ElevenLabs claims, is its integration with Anthropic's Model Context Protocol (MCP), which is quickly becoming an industry standard for seamlessly connecting AI systems, especially agents, with proprietary data. MCP has even been adopted by Anthropic competitors, including Meta, Google, and OpenAI.
Garmin Forerunner 970 review: the new benchmark for running watches
Garmin's new top running watch, the Forerunner 970, has very big shoes to fill as it attempts to replace one of the best training and race companions available. Can a built-in torch, a software revamp and voice control really make a difference? The Guardian's journalism is independent. We will earn a commission if you buy something through an affiliate link. The new top-of-the-line Forerunner takes the body of the outgoing Forerunner 965 and squeezes in a much brighter display, useful new running analytics and more of the advanced tech from Garmin's flagship adventure watch the Fenix 8. These upgrades come at a steep cost of 630 ( 750/ 750/A 1,399) โ 30 more than its predecessor โ placing it right at the top of the running and triathlon watch pile, although less than the 780 Fenix 8.
A Experiment Details
A.1 Data Table A1 summarizes the datasets used in this paper, which are all licensed under CC BY-NC-ND or CC BY and have been used extensively by the research communities. Speech datasets are sourced from interviews, TED talks, and audiobooks, which are not expected to contain offensive content. A.3 Speech recognition decoding Beam search decoding is used with a length weight ฮฑ, which searches for the hypothesis z For the results in the main paper, we do grid search from beam size {1, 5, 10, 15, 20, 25} and ฮฑ {0, 0.5, 1.0, 1.5}. For the ablation studies in the appendix a beam size of 10 and ฮฑ = 1.0 is used. B.1 Impact of fine-tuning hyperparameters We conduct the ablation studies in this section with the models pre-trained on multimodal LRS3 and VC2-En.
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matthew Le Bowen Shi Brian Karrer
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization.
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones.
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
Research in auditory, visual, and audiovisual speech recognition (ASR, VSR, and AVSR, respectively) has traditionally been conducted independently. Even recent self-supervised studies addressing two or all three tasks simultaneously tend to yield separate models, leading to disjoint inference pipelines with increased memory requirements and redundancies. This paper proposes unified training strategies for these systems. We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance, overcoming typical optimisation challenges when training from scratch. Moreover, we introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples, addressing shortcomings in related self-supervised methods. Finally, we develop a self-supervised pretraining method within our framework, proving its effectiveness alongside our semi-supervised approach. Despite using a single model for all tasks, our unified approach achieves state-of-the-art performance compared to recent methods on LRS3 and LRS2 for ASR, VSR, and AVSR, as well as on the newly released WildVSR dataset. Code and models are available at https://github.com/
Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR
Unsupervised automatic speech recognition (ASR) aims to learn the mapping between the speech signal and its corresponding textual transcription without the supervision of paired speech-text data. A word/phoneme in the speech signal is represented by a segment of speech signal with variable length and unknown boundary, and this segmental structure makes learning the mapping between speech and text challenging, especially without paired data.