Speech
New video games for players with disabilities coming, courtesy of Electronic Arts
EA's Patent Pledge is a commitment to providing royalty-free IP to the gaming industry at large, and a promise not to enforce EA patent infringements for the accessible technologies the company offers. The 23 new technologies include six audio and sound technologies, as well as a new opensourced photosensitivity analysis plugin for the Unreal Engine 5 builder. The plug-in allows designers to test their games using EA's IRIS tech in real-time -- IRIS makes gameplay easier for people with epilepsy or other photosensitivities. The audio patents include tech for improved and simplified speech recognition and more personalized speech options, including text-to-speech in the voice of video game characters and machine-learning powered voice aging. These options make in-game expression more inclusive for gamers.
Microsoft will let you clone your voice for Teams calls, powered by AI
Microsoft Teams users will soon be able to use cloned versions of their voices to speak and translate conversation in real time, as the company unveils its new, AI-powered Interpreter tool. Announced at the annual Microsoft Ignite conference and reported by TechCrunch, the new feature allows users to create digital replicas of their voices that can then be used to translate their speech into various languages. "Imagine being able to sound just like you in a different language. Interpreter in Teams provides real-time speech-to-speech translation during meetings, and you can opt to have it simulate your speaking voice for a more personal and engaging experience," wrote Microsoft CMO Jared Spataro in a blog post shared with the publication. The feature will only be available to Microsoft365 subscribers, and will launch initially for English, French, German, Italian, Japanese, Korean, Portuguese, Mandarin Chinese, and Spanish. Microsoft's Interpreter has the potential to make the business of remote work and digital socialization more accessible to a wider array of non-English speakers, though it's not yet as dynamic as a live, human translator.
Google now offers a standalone Gemini app on iPhone
Google now offers a dedicated Gemini AI app on iPhone. First spotted by MacRumors, the free software is available to download in Australia, India, the US and the UK following a soft launch in the Philippines earlier this week. Before today, iPhone users could access Gemini through the Google app, though there were some notable limitations. For instance, the dedicated app includes Google's Gemini Live feature, which allows users to interact with the AI agent from their iPhone's Dynamic Island and Lock Screen. As a result, you don't need to have the app open on your phone's screen to use Gemini.
Avoiding Siri slipups and apologies for butt dials
Voice assistants may cause confusion across devices. Tech expert Kurt Knutsson offers some solutions to fix it. When it comes to using voice assistants across multiple devices, things can get a bit tricky. "Mike" from St. George, Utah, found himself in a comical yet frustrating situation with his personal and work iPhones. Let's dive into his predicament and explore some solutions.
This Qualcomm-Google partnership may give us the in-car voice assistants we've been waiting for
If there's one thing we've learned over the past year, it's that generative AI is playing a pivotal role in technological advancements, including where people spend much of their day -- automobiles. Today, Google and Qualcomm have officially partnered to leverage their chips and generative AI capabilities to help developers create more enriched automotive experiences. Also: Qualcomm's new chipset that will power flagship Android phones makes the iPhone seem outdated On Tuesday, at the chipmaker's Snapdragon Summit, Qualcomm announced a multi-year collaboration with Google that will utilize both companies' latest technologies, including Snapdragon Digital Chassis, Android Automotive OS, and Google Cloud, to advance digital transformation in cars and develop generative AI-enabled digital cockpits. According to the press release, Google will bring its AI expertise to the collaboration, allowing for the development of intuitive generative AI experiences that anticipate users' needs, such as more advanced voice assistants and immersive mapping, making driving as a whole more burdenless. Meanwhile, Qualcomm will supply its Snapdragon heterogeneous edge AI system-on-chips (SoC) and its Qualcomm AI Hub, a platform that developers can use to deploy and manage AI models onto Qualcomm-powered devices to run the experiences and underlying vision, audio, and speech models.
Multimodal and Multilingual Embeddings for Large-Scale Speech Mining
We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.
HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis
This paper presents HierSpeech, a high-quality end-to-end text-to-speech (TTS) system based on a hierarchical conditional variational autoencoder (VAE) utilizing self-supervised speech representations. Recently, single-stage TTS systems, which directly generate raw speech waveform from text, have been getting interest thanks to their ability in generating high-quality audio within a fully end-to-end training pipeline. However, there is still a room for improvement in the conventional TTS systems. Since it is challenging to infer both the linguistic and acoustic attributes from the text directly, missing the details of attributes, specifically linguistic information, is inevitable, which results in mispronunciation and over-smoothing problem in their synthetic speech. To address the aforementioned problem, we leverage self-supervised speech representations as additional linguistic representations to bridge an information gap between text and speech.
Star Temporal Classification: Sequence Modeling with Partially Labeled Data
We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition.
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.