Goto

Collaborating Authors

 whisper


Quantization for OpenAI's Whisper Models: A Comparative Analysis

Andreyev, Allison

arXiv.org Artificial Intelligence

Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git


Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Li, Jinpeng, Pu, Yu, Sun, Qi, Zhang, Wei-Qiang

arXiv.org Artificial Intelligence

Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost data to improve the performance of Whisper on under-represented languages. In this study, we utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh. We implemented end of transcript (EOT) judgment modification and hallucination penalty to improve the performance of speech recognition. Further, we employed the decoding average token log probability as a criterion to select samples from unlabeled speech data and used pseudo-labeled data to fine-tune the model to further improve its performance. Ultimately, we achieved more than 10\% absolute WER reduction in multiple experiments, and the whole process has the potential to be generalized to other under-represented languages.


Indigenous groups fear culture distortion as AI learns their languages

The Japan Times

When U.S. tech firm OpenAI rolled out Whisper, a speech recognition tool offering audio transcription and translation into English for dozens of languages including Maori, it rang alarm bells for many Indigenous New Zealanders. Whisper, launched in September by the company behind the ChatGPT chatbot, was trained on 680,000 hours of audio from the web, including 1,381 hours of the Maori language. Indigenous tech and culture experts say that while such technologies can help preserve and revive their languages, harvesting their data without consent risks abuse, distorting of Indigenous culture, and depriving minorities of their rights. This could be due to a conflict with your ad-blocking or security software. Please add japantimes.co.jp and piano.io to your list of allowed sites.


Interesting ChatGPT Apps

#artificialintelligence

With the success of transformer-based pretrained language models in various NLP tasks, dialogue-oriented pretrained language models have been developed. ChatGPT is an extraordinary dialogue-oriented (chatbot) model released by Open AI in November 2022. The internet users explored how ChatGPT can be used for various tasks like question answering, code generation, code debugging, blog post writing, learning new concepts, etc. Now you are going to explore some of the interesting ChatGPT apps. In general, to interact with ChatGPT you have to pass on the commands i.e., your queries or instructions as text.


The 5 most important recent developments in AI

#artificialintelligence

From solving maths and science problems to translating with astonishing accuracy between hundreds of languages – not to mention generating images and videos based on a natural language prompt – AI is making strides pretty much across the board. In this article, I'll briefly discuss some of the most recent (and the most exciting!) So, without further ado, let's dive in! Released on 1 August 2022, Minerva is a language model capable of not only solving maths and science problems submitted in the form of natural language, but also of providing its reasoning behind the answer. So far, Google has built three versions of the model, getting bigger with each iteration.


Using Whisper (speech-to-text) and Tortoise (text-to-speech)

#artificialintelligence

I’ll demonstrate how to extract an audio clip from YouTube, implement speech recognition using OpenAI’s Whisper, and perform speech generation using Tortoise to clone a custom voice.


Focus on Whisper, OpenAI's automatic speech recognition system - Actu IA

#artificialintelligence

OpenAI recently released Whisper, a 1.6 billion parameter AI model capable of transcribing and translating speech audio from 97 different languages, showing robust performance on a wide range of automated speech recognition (ASR) tasks. The model trained on 680,000 hours of audio data collected from the web was soon published as open source on GitHub. Whisper uses a transform-encoder-decoder architecture, the input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed through an encoder. Unlike most state-of-the-art ASR models, it has not been fitted to a specific data set, but instead has been trained using weak supervision on a large-scale noisy data set collected from the Internet. Although it did not beat the specialized LibriSpeech performance models, in zero-shot evaluations on a diverse dataset, Whisper proved to be more robust and made 50% fewer errors than those models.


Testing OpenAI's whisper with a Scottish accent

#artificialintelligence

OpenAI's recent release of Whisper boasts human-level robustness and accuracy in speech recognition. I'm not Scottish (although I was born pretty close), but I immediately wanted to test it with a Scottish accent and compare it to "human-level". Having bought an unexciting new iPhone, at least I could put its A16 Bionic chip with 16-core Neural Engine through its paces for my experiment. Once the boring tech stuff was out of the way, I shared the test app on TestFlight with a few colleagues, yielding much amusement with its borderline magical results. Here's a little clip from the start of Trainspotting, which is particularly challenging for machines to understand; a Scottish accent over the top of Iggy Pop isn't something you'd train for.


How will OpenAI's Whisper model impact AI applications?

#artificialintelligence

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Last week, OpenAI released Whisper, an open-source deep learning model for speech recognition. Developers and researchers who have experimented with Whisper are also impressed with what the model can do. However, what is perhaps equally important is what Whisper's release tells us about the shifting culture in artificial intelligence (AI) research and the kind of applications we can expect in the future.


OpenAI can hear you Whisper

#artificialintelligence

Speech recognition remains a challenge in artificial intelligence, but OpenAI's latest move takes us one step closer to solving it. The software is an automatic speech recognition (ASR) system trained on 680.000 hours of multilingual and multitask supervised data from the web. Other organizations like Google, Meta and Amazon have all tried to design ASR-systems that lie at the core of many products. OpenAI now could outperform every one of those ASR-systems. What makes this new software different is the robustness against background noises, accents and technical terminology.