Speech Recognition
Microsoft will let you clone your voice for Teams calls, powered by AI
Microsoft Teams users will soon be able to use cloned versions of their voices to speak and translate conversation in real time, as the company unveils its new, AI-powered Interpreter tool. Announced at the annual Microsoft Ignite conference and reported by TechCrunch, the new feature allows users to create digital replicas of their voices that can then be used to translate their speech into various languages. "Imagine being able to sound just like you in a different language. Interpreter in Teams provides real-time speech-to-speech translation during meetings, and you can opt to have it simulate your speaking voice for a more personal and engaging experience," wrote Microsoft CMO Jared Spataro in a blog post shared with the publication. The feature will only be available to Microsoft365 subscribers, and will launch initially for English, French, German, Italian, Japanese, Korean, Portuguese, Mandarin Chinese, and Spanish. Microsoft's Interpreter has the potential to make the business of remote work and digital socialization more accessible to a wider array of non-English speakers, though it's not yet as dynamic as a live, human translator.
Google now offers a standalone Gemini app on iPhone
Google now offers a dedicated Gemini AI app on iPhone. First spotted by MacRumors, the free software is available to download in Australia, India, the US and the UK following a soft launch in the Philippines earlier this week. Before today, iPhone users could access Gemini through the Google app, though there were some notable limitations. For instance, the dedicated app includes Google's Gemini Live feature, which allows users to interact with the AI agent from their iPhone's Dynamic Island and Lock Screen. As a result, you don't need to have the app open on your phone's screen to use Gemini.
Avoiding Siri slipups and apologies for butt dials
Voice assistants may cause confusion across devices. Tech expert Kurt Knutsson offers some solutions to fix it. When it comes to using voice assistants across multiple devices, things can get a bit tricky. "Mike" from St. George, Utah, found himself in a comical yet frustrating situation with his personal and work iPhones. Let's dive into his predicament and explore some solutions.
This Qualcomm-Google partnership may give us the in-car voice assistants we've been waiting for
If there's one thing we've learned over the past year, it's that generative AI is playing a pivotal role in technological advancements, including where people spend much of their day -- automobiles. Today, Google and Qualcomm have officially partnered to leverage their chips and generative AI capabilities to help developers create more enriched automotive experiences. Also: Qualcomm's new chipset that will power flagship Android phones makes the iPhone seem outdated On Tuesday, at the chipmaker's Snapdragon Summit, Qualcomm announced a multi-year collaboration with Google that will utilize both companies' latest technologies, including Snapdragon Digital Chassis, Android Automotive OS, and Google Cloud, to advance digital transformation in cars and develop generative AI-enabled digital cockpits. According to the press release, Google will bring its AI expertise to the collaboration, allowing for the development of intuitive generative AI experiences that anticipate users' needs, such as more advanced voice assistants and immersive mapping, making driving as a whole more burdenless. Meanwhile, Qualcomm will supply its Snapdragon heterogeneous edge AI system-on-chips (SoC) and its Qualcomm AI Hub, a platform that developers can use to deploy and manage AI models onto Qualcomm-powered devices to run the experiences and underlying vision, audio, and speech models.
Star Temporal Classification: Sequence Modeling with Partially Labeled Data
We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. Experiments using all labeled data of Librispeech achieve 1.8/3.3 When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 This demonstrates the feasibility of speech recognition with limited amounts of labeled data.
TVLT: Textless Vision-Language Transformer
In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR). TVLT is trained by reconstructing masked patches of continuous video frames and audio spectrograms (masked autoencoding) and contrastive modeling to align video and audio. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without assuming the prior existence of text. Our code and checkpoints are available at: https://github.com/zinengtang/TVLT
Speech-T: Transducer for Text to Speech and Beyond
Neural Transducer (e.g., RNN-T) has been widely used in automatic speech recognition (ASR) due to its capabilities of efficiently modeling monotonic alignments between input and output sequences and naturally supporting streaming inputs. Considering that monotonic alignments are also critical to text to speech (TTS) synthesis and streaming TTS is also an important application scenario, in this work, we explore the possibility of applying Transducer to TTS and more. However, it is challenging because it is difficult to trade off the emission (continuous mel-spectrogram prediction) probability and transition (ASR Transducer predicts blank token to indicate transition to next input) probability when calculating the output probability lattice in Transducer, and it is not easy to learn the alignments between text and speech through the output probability lattice. We propose SpeechTransducer (Speech-T for short), a Transformer based Transducer model that 1) uses a new forward algorithm to separate the transition prediction from the continuous mel-spectrogram prediction when calculating the output probability lattice, and uses a diagonal constraint in the probability lattice to help the alignment learning; 2) supports both full-sentence or streaming TTS by adjusting the look-ahead context; and 3) further supports both TTS and ASR together for the first time, which enjoys several advantages including fewer parameters as well as streaming synthesis and recognition in a single model. Experiments on LJSpeech datasets demonstrate that Speech-T 1) is more robust than the attention based autoregressive TTS model due to its inherent monotonic alignments between text and speech; 2) naturally supports streaming TTS with good voice quality; and 3) enjoys the benefit of joint modeling TTS and ASR in a single network.
Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
Neural networks have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones.
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
We present a factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision. Specifically, we exploit the multi-scale nature of information in sequential data by formulating it explicitly within a factorized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. The model is evaluated on two speech corpora to demonstrate, qualitatively, its ability to transform speakers or linguistic content by manipulating different sets of latent variables; and quantitatively, its ability to outperform an i-vector baseline for speaker verification and reduce the word error rate by as much as 35% in mismatched train/test scenarios for automatic speech recognition tasks.