AITopics | speech synthesis system

Collaborating Authors

speech synthesis system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW

Mehta, Prateek, Patil, Anasuya

arXiv.org Artificial IntelligenceJun-19-2025

Abstract: Knowledge extraction just by listening to sounds is known a s a distinctive property. Visually impaired people are dependent solely on Braille books & audio recordings provided by NGOs. Owing to many constraints in above two approaches blind people can't access the book of their choice. As the speech form is a more effective means of communication than text as blind and visually impaired persons can easily respond to sounds. This paper aims to develop an accurate, reliable, cost effective, and user - friendly optical character recognition (OCR) based speech synthesis system.

artificial intelligence, optical character recognition, speech synthesis, (11 more...)

arXiv.org Artificial Intelligence

2506.15029

Country: Asia > India (0.29)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.60)

Technology: Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)

Add feedback

MunTTS: A Text-to-Speech System for Mundari

Gumma, Varun, Hada, Rishav, Yadavalli, Aditya, Gogoi, Pamir, Mondal, Ishani, Seshadri, Vivek, Bali, Kalika

arXiv.org Artificial IntelligenceJan-28-2024

We present MunTTS, an end-to-end text-to-speech (TTS) system specifically for Mundari, a low-resource Indian language of the Austo-Asiatic family. Our work addresses the gap in linguistic technology for underrepresented languages by collecting and processing data to build a speech synthesis system. We begin our study by gathering a substantial dataset of Mundari text and speech and train end-to-end speech models. We also delve into the methods used for training our models, ensuring they are efficient and effective despite the data constraints. We evaluate our system with native speakers and objective metrics, demonstrating its potential as a tool for preserving and promoting the Mundari language in the digital age.

international conference, mundari, speech, (17 more...)

arXiv.org Artificial Intelligence

2401.15579

Country:

Asia > Indonesia > Bali (0.05)
Asia > India > Jharkhand (0.04)
North America > United States > Maryland (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

Add feedback

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

Yoshimura, Takenori, Takaki, Shinji, Nakamura, Kazuhiro, Oura, Keiichiro, Hono, Yukiya, Hashimoto, Kei, Nankaku, Yoshihiko, Tokuda, Keiichi

arXiv.org Artificial IntelligenceNov-21-2022

This paper integrates a classic mel-cepstral synthesis filter into a modern neural speech synthesis system towards end-to-end controllable speech synthesis. Since the mel-cepstral synthesis filter is explicitly embedded in neural waveform models in the proposed system, both voice characteristics and the pitch of synthesized speech are highly controlled via a frequency warping parameter and fundamental frequency, respectively. We implement the mel-cepstral synthesis filter as a differentiable and GPU-friendly module to enable the acoustic and waveform models in the proposed system to be simultaneously optimized in an end-to-end manner. Experiments show that the proposed system improves speech quality from a baseline system maintaining controllability. The core PyTorch modules used in the experiments will be publicly available on GitHub.

artificial intelligence, machine learning, synthesis filter, (17 more...)

arXiv.org Artificial Intelligence

2211.11222

Country:

Asia > Japan > Honshū > Chūbu > Aichi Prefecture > Nagoya (0.04)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

#artificialintelligenceFeb-26-2022, 05:10:43 GMT

In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.

representation, speech synthesis, synthesis, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.32)

Add feedback

FMFCC-A: A Challenging Mandarin Dataset for Synthetic Speech Detection

Zhang, Zhenyu, Gu, Yewei, Yi, Xiaowei, Zhao, Xianfeng

arXiv.org Artificial IntelligenceOct-18-2021

As increasing development of text-to-speech (TTS) and voice conversion (VC) technologies, the detection of synthetic speech has been suffered dramatically. In order to promote the development of synthetic speech detection model against Mandarin TTS and VC technologies, we have constructed a challenging Mandarin dataset and organized the accompanying audio track of the first fake media forensic challenge of China Society of Image and Graphics (FMFCC-A). The FMFCC-A dataset is by far the largest publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterances collected from 58 speakers. The FMFCC-A dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesized Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations. In addition to describing the construction of the FMFCC-A dataset, we provide a detailed analysis of two baseline methods and the top-performing submissions from the FMFCC-A, which illustrates the usefulness and challenge of FMFCC-A dataset. We hope that the FMFCC-A dataset can fill the gap of lack of Mandarin datasets for synthetic speech detection.

dataset, fmfcc-a dataset, speech synthesis system, (11 more...)

arXiv.org Artificial Intelligence

2110.09441

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis

Tits, Noé, Wang, Fengna, Haddad, Kevin El, Pagel, Vincent, Dutoit, Thierry

arXiv.org Artificial IntelligenceMar-27-2019

The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.

artificial intelligence, audio feature, machine learning, (14 more...)

arXiv.org Artificial Intelligence

1903.1157

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Scaling and bias codes for modeling speaker-adaptive DNN-based speech synthesis systems

Luong, Hieu-Thi, Yamagishi, Junichi

arXiv.org Machine LearningJul-30-2018

ABSTRACT Most neural-network based speaker-adaptive acoustic models for speech synthesis can be categorized into either layer-based or input-code approaches. Although both approaches have their own pros and cons, most existing works on speaker adaptation focus on improving one or the other. In this paper, after we first systematically overview the common principles of neural-network based speaker-adaptive models, we show that these approaches can be represented in a unified framework and can be generalized further. More specifically, we introduce the use of scaling and bias codes as generalized means for speaker-adaptive transformation. By utilizing these codes, we can create a more efficient factorized speaker-adaptive model and capture advantages of both approaches while reducing their disadvantages. The experiments show that the proposed method can improve the performance of speaker adaptation compared with speaker adaptation based on the conventional input code. Index Terms -- speech synthesis, speaker adaptation, neural network, factorization, speaker code 1. INTRODUCTION Recent speaker-dependent speech synthesis systems can generate high-quality reading speech indistinguishable from natural human speech when their training data is recorded in a quality-controlled condition and have sufficient amount of data [1].

artificial intelligence, machine learning, speech recognition, (17 more...)

arXiv.org Machine Learning

1807.11632

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Adigwe, Adaeze, Tits, Noé, Haddad, Kevin El, Ostadabbas, Sarah, Dutoit, Thierry

arXiv.org Artificial IntelligenceJun-25-2018

In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.

artificial intelligence, database, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1806.09514

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > California > Santa Clara County > Sunnyvale (0.04)
(3 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback