Goto

Collaborating Authors

 transcribe


ContextASR-Bench: A Massive Contextual Speech Recognition Benchmark

Wang, He, Ma, Linhan, Guo, Dake, Wang, Xiong, Xie, Lei, Xu, Jin, Lin, Junyang

arXiv.org Artificial Intelligence

Automatic Speech Recognition (ASR) has been extensively investigated, yet prior benchmarks have largely focused on assessing the acoustic robustness of ASR models, leaving evaluations of their linguistic capabilities relatively underexplored. This largely stems from the limited parameter sizes and training corpora of conventional ASR models, leaving them with insufficient world knowledge, which is crucial for accurately recognizing named entities across diverse domains. For instance, drug and treatment names in medicine or specialized technical terms in engineering. Recent breakthroughs in Large Language Models (LLMs) and corresponding Large Audio Language Models (LALMs) have markedly enhanced the visibility of advanced context modeling and general artificial intelligence capabilities. Leveraging LLMs, we envision a unified system capable of robust speech recognition across diverse real-world domains, yet existing benchmarks are inadequate for evaluating this objective. To address this gap, we propose ContextASR-Bench: a comprehensive, large-scale benchmark designed to assess the linguistic competence of ASR systems using corpora that feature numerous named entities across multiple domains. It encompasses up to 40,000 data entries with more than 300,000 named entities across over 10 domains. Beyond the audio and its transcription, each sample provides the domain it belongs to and a list of named entities it contains, which are referred to as the context. Based on this, we introduce three evaluation modes to assess how effectively models can exploit such context to improve ASR accuracy. Extensive evaluation on ContextASR-Bench highlights that LALMs outperform conventional ASR models by a large margin thanks to the strong world knowledge and context modeling of LLMs, yet there remains ample room for further improvement. The dataset and evaluation code have been released.


The Plaud NotePin Is an AI Notetaker That Will Transcribe Your Meetings--and Your Entire Life

WIRED

If you want to coast through meetings, keep track of everyone you meet, or just remember the name of that obscure dog food your veterinarian told you to feed your pooch, there's a necklace for that. Plaud is an AI company that makes the creatively named Plaud Note--a slim ChatGPT-enabled audio recorder that can be stuck on the back of your phone or slipped into a shirt pocket to record, transcribe, and summarize your conversations. The company's newest offering is called the Plaud NotePin (the naming scheme doesn't get any better here), and it takes basically all the same features of the Note and packs them into a wearable device about the size of a lipstick tube. The NotePin can be worn as a necklace, a wristwatch, or a pin, or clipped onto something like a lapel. It costs 169 and lets you record up to 300 minutes of audio per month.


Instruction-Following Speech Recognition

Lai, Cheng-I Jeff, Lu, Zhiyun, Cao, Liangliang, Pang, Ruoming

arXiv.org Artificial Intelligence

Conventional end-to-end Automatic Speech Recognition (ASR) models primarily focus on exact transcription tasks, lacking flexibility for nuanced user interactions. With the advent of Large Language Models (LLMs) in speech processing, more organic, text-prompt-based interactions have become possible. However, the mechanisms behind these models' speech understanding and "reasoning" capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. This enables a multitude of speech recognition tasks -- ranging from transcript manipulation to summarization -- without relying on predefined command sets. Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring LLMs or pre-trained speech modules. It also offers selective transcription options based on instructions like "transcribe first half and then turn off listening," providing an additional layer of privacy and safety compared to existing LLMs. Our findings highlight the significant potential of instruction-following training to advance speech foundation models.


Melody transcription via generative pre-training

Donahue, Chris, Thickstun, John, Liang, Percy

arXiv.org Artificial Intelligence

Despite the central role that melody plays in music perception, it remains an open challenge in music information retrieval to reliably detect the notes of the melody present in an arbitrary music recording. A key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles - existing strategies work well for some melody instruments or styles but not all. To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio, thereby improving performance on melody transcription by $20$% relative to conventional spectrogram features. Another obstacle in melody transcription is a lack of training data - we derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music. The combination of generative pre-training and a new dataset for this task results in $77$% stronger performance on melody transcription relative to the strongest available baseline. By pairing our new melody transcription approach with solutions for beat detection, key estimation, and chord recognition, we build Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio. Audio examples can be found at https://chrisdonahue.com/sheetsage and code at https://github.com/chrisdonahue/sheetsage .


Analyze User Speech Quickly With Parallel Serverless Architecture

#artificialintelligence

It's crucial to analyze quickly what your users are saying verbally as they interact with your business. This can range from dissatisfied customer calls to profanity-laced user chats in your app. But this can also be about experience improvement, advertising and sales opportunities. Quick action on your part is needed once audio of interest has been flagged. Of course, you could consider doing classification in real-time for every utterance users say.


Microsoft 365 saves you time and effort with transcription and voice commands in Word - Microsoft 365 Blog

#artificialintelligence

Now more than ever, we're all very busy--juggling family, work, friends, and whatever else life throws our way. New enhancements in Office leverage the Azure Cognitive Services AI platform so you can harness the power of your voice to spend less time and energy creating your best work and focus on what matters most. Whether you're a reporter conducting interviews, a researcher recording focus group sessions, or an online entrepreneur recording informal discussions, you want to be able to focus on the people you're talking to without worrying about taking notes and without having to spend hours transcribing your conversations after-the-fact. If that sounds like you, Transcribe in Word is here to help. Now you can record your conversations directly in Word for the web and transcribe them automatically.


Bringing Live Transcribe's Speech Engine to Everyone

#artificialintelligence

Earlier this year, Google launched Live Transcribe, an Android application that provides real-time automated captions for people who are deaf or hard of hearing. Through many months of user testing, we've learned that robustly delivering good captions for long-form conversations isn't so easy, and we want to make it easier for developers to build upon what we've learned. Live Transcribe's speech recognition is provided by Google's state-of-the-art Cloud Speech API, which under most conditions delivers pretty impressive transcript accuracy. However, relying on the cloud introduces several complications--most notably robustness to ever-changing network connections, data costs, and latency. Today, we are sharing our transcription engine with the world so that developers everywhere can build applications with robust transcription. Those who have worked with our Cloud Speech API know that sending infinitely long streams of audio is currently unsupported.


Google open-sources Live Transcribe's speech engine

#artificialintelligence

The company hopes doing so will let any developer deliver captions for long-form conversations. The source code is available now on GitHub. Google released Live Transcribe in February. The tool uses machine learning algorithms to turn audio into real-time captions. Unlike Android's upcoming Live Caption feature, Live Transcribe is a full-screen experience, uses your smartphone's microphone (or an external microphone), and relies on the Google Cloud Speech API.


Developer Attempts to Transcribe a Podcast with Microsoft's Speech API. Hilarity Ensues.

#artificialintelligence

Over the last few years, the wave of machine learning and artificial intelligence APIs has been cresting as more and more businesses see the potential for differentiation and more and more API providers look to service that need. IBM clearly recognized the potential of these APIs when it acquired AlchemyAPI back in 2015. Alchemy specialized in machine-learning driven APIs like sentiment analysis and image/language processing. Those APIs are now a part of IBM's Watson portfolio. Now, a few years later, everyone is getting into the game.


Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI

#artificialintelligence

For most businesses, machine learning seems close to rocket science, appearing expensive and talent demanding. And, if you're aiming at building another Netflix recommendation system, it really is. But the trend of making everything-as-a-service has affected this sophisticated sphere, too. You can jump-start an ML initiative without much investment, which would be the right move if you are new to data science and just want to grab the low hanging fruit. One of Machine Learning most inspiring stories is the one about a Japanese farmer who decided to sort cucumbers automatically to help his parents with this painstaking operation.