Goto

Collaborating Authors

 whisper


Whisper Into This AI-Powered Smart Ring to Organize Your Thoughts

WIRED

A new company called Sandbar has unveiled a smart wearable called Stream Ring, which uses a microphone to record your softly spoken thoughts. Everyone has an inner monologue. When you're commuting on the train, riding a bike, or in the shower, chances are you're thinking about the day ahead, tasks you need to do, or maybe just mulling over a conversation you had the night before. Much of this stays in our brains, soon to be forgotten or pushed away when the train comes to the station. But what if you could have it all subtly recorded in one place, ready for you to digest later on?


Quantization for OpenAI's Whisper Models: A Comparative Analysis

Andreyev, Allison

arXiv.org Artificial Intelligence

Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git


Deepfake Detection of Singing Voices With Whisper Encodings

Sharma, Falguni, Gupta, Priyanka

arXiv.org Artificial Intelligence

The deepfake generation of singing vocals is a concerning issue for artists in the music industry. In this work, we propose a singing voice deepfake detection (SVDD) system, which uses noise-variant encodings of open-AI's Whisper model. As counter-intuitive as it may sound, even though the Whisper model is known to be noise-robust, the encodings are rich in non-speech information, and are noise-variant. This leads us to evaluate Whisper encodings as feature representations for the SVDD task. Therefore, in this work, the SVDD task is performed on vocals and mixtures, and the performance is evaluated in \%EER over varying Whisper model sizes and two classifiers- CNN and ResNet34, under different testing conditions.


OpenAI's Whisper invents parts of transcriptions -- a lot

Engadget

Imagine going to the doctor, telling them exactly how you're feeling and then a transcription later adds false information and alters your story. That could be the case in medical centers that use Whisper, OpenAI's transcription tool. Over a dozen developers, software engineers and academic researchers have found evidence that Whisper creates hallucinations -- invented text -- that includes made up medications, racial commentary and violent remarks, ABC News reports. Yet, in the last month, open-source AI platform HuggingFace saw 4.2 million downloads of Whisper's latest version. The tool is also built into Oracle and Microsoft's cloud computing platforms, along with some versions of ChatGPT.


Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Li, Jinpeng, Pu, Yu, Sun, Qi, Zhang, Wei-Qiang

arXiv.org Artificial Intelligence

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.


Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Hu, Yuchen, Chen, Chen, Yang, Chao-Han Huck, Qin, Chengwei, Chen, Pin-Yu, Chng, Eng Siong, Zhang, Chao

arXiv.org Artificial Intelligence

We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. STAR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary). Specifically, we propose a novel indicator that empirically integrates step-wise information during decoding to assess the token-level quality of pseudo labels without ground truth, thereby guiding model updates for effective unsupervised adaptation. Experimental results show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains, and it sometimes even approaches the upper-bound performance of supervised adaptation. Surprisingly, we also observe that STAR prevents the adapted model from the common catastrophic forgetting problem without recalling source-domain data. Furthermore, STAR exhibits high data efficiency that only requires less than one-hour unlabeled data, and seamless generality to alternative large speech models and speech translation tasks. Our code aims to open source to the research communities.


Interesting ChatGPT Apps

#artificialintelligence

With the success of transformer-based pretrained language models in various NLP tasks, dialogue-oriented pretrained language models have been developed. ChatGPT is an extraordinary dialogue-oriented (chatbot) model released by Open AI in November 2022. The internet users explored how ChatGPT can be used for various tasks like question answering, code generation, code debugging, blog post writing, learning new concepts, etc. Now you are going to explore some of the interesting ChatGPT apps. In general, to interact with ChatGPT you have to pass on the commands i.e., your queries or instructions as text.


The 5 most important recent developments in AI

#artificialintelligence

From solving maths and science problems to translating with astonishing accuracy between hundreds of languages – not to mention generating images and videos based on a natural language prompt – AI is making strides pretty much across the board. In this article, I'll briefly discuss some of the most recent (and the most exciting!) So, without further ado, let's dive in! Released on 1 August 2022, Minerva is a language model capable of not only solving maths and science problems submitted in the form of natural language, but also of providing its reasoning behind the answer. So far, Google has built three versions of the model, getting bigger with each iteration.


Using Whisper (speech-to-text) and Tortoise (text-to-speech)

#artificialintelligence

I’ll demonstrate how to extract an audio clip from YouTube, implement speech recognition using OpenAI’s Whisper, and perform speech generation using Tortoise to clone a custom voice.


Focus on Whisper, OpenAI's automatic speech recognition system - Actu IA

#artificialintelligence

OpenAI recently released Whisper, a 1.6 billion parameter AI model capable of transcribing and translating speech audio from 97 different languages, showing robust performance on a wide range of automated speech recognition (ASR) tasks. The model trained on 680,000 hours of audio data collected from the web was soon published as open source on GitHub. Whisper uses a transform-encoder-decoder architecture, the input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed through an encoder. Unlike most state-of-the-art ASR models, it has not been fitted to a specific data set, but instead has been trained using weak supervision on a large-scale noisy data set collected from the Internet. Although it did not beat the specialized LibriSpeech performance models, in zero-shot evaluations on a diverse dataset, Whisper proved to be more robust and made 50% fewer errors than those models.