AITopics | speecht5 model

Collaborating Authors

speecht5 model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Edge-Based Speech Transcription and Synthesis for Kinyarwanda and Swahili Languages

Mbonimpa, Pacome Simon, Tuyizere, Diane, Biyabani, Azizuddin Ahmed, Tonguz, Ozan K.

arXiv.org Artificial IntelligenceOct-21-2025

Abstract--This paper presents a novel framework for speech transcription and synthesis, leveraging edge-cloud parallelism to enhance processing speed and accessibility for Kinyarwanda and Swahili speakers. It addresses the scarcity of powerful language processing tools for these widely spoken languages in East African countries with limited technological infrastructure. The framework utilizes the Whisper and SpeechT5 pre-trained models to enable speech-to-text (STT) and text-to-speech (TTS) translation. The architecture uses a cascading mechanism that distributes the model inference workload between the edge device and the cloud, thereby reducing latency and resource usage, benefiting both ends. On the edge device, our approach achieves a memory usage compression of 9.5% for the SpeechT5 model and 14% for the Whisper model, with a maximum memory usage of 149 MB. Experimental results indicate that on a 1.7 GHz CPU edge device with a 1 MB/s network bandwidth, the system can process a 270-character text in less than a minute for both speech-to-text and text-to-speech transcription. Using real-world survey data from Kenya, it is shown that the cascaded edge-cloud architecture proposed could easily serve as an excellent platform for STT and TTS transcription with good accuracy and response time. I. INTRODUCTION In today's digital age, the need for accurate and efficient speech transcription and synthesis models has been increasing rapidly. These models play an important role in a variety of applications, such as learning new language(s), accessibility tools for people with difficulties in reading and hearing, as well as automated voice assistants [1]. Kinyarwanda and Swahili are two of the local languages spoken in East Africa. While Swahili is the most widely spoken language in Eastern Africa, the speakers range from 60 million to over 150 million [2].

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.16497

Country: Africa > East Africa (0.54)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model

Lehečka, Jan, Hanzlíček, Zdeněk, Matoušek, Jindřich, Tihelka, Daniel

arXiv.org Artificial IntelligenceJul-24-2024

In this paper, we experimented with the SpeechT5 model pre-trained on large-scale datasets. We pre-trained the foundation model from scratch and fine-tuned it on a large-scale robust multi-speaker text-to-speech (TTS) task. We tested the model capabilities in a zero- and few-shot scenario. Based on two listening tests, we evaluated the synthetic audio quality and the similarity of how synthetic voices resemble real voices. Our results showed that the SpeechT5 model can generate a synthetic voice for any speaker using only one minute of the target speaker's data. We successfully demonstrated the high quality and similarity of our synthetic voices on publicly known Czech politicians and celebrities.

dataset, speech, speecht5 model, (17 more...)

arXiv.org Artificial Intelligence

2407.17167

Country:

Europe > Czechia (0.04)
South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.69)
Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback