AITopics | Yatabe, Kohei

Collaborating Authors

Yatabe, Kohei

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sampling-Frequency-Independent Universal Sound Separation

Nakamura, Tomohiko, Yatabe, Kohei

arXiv.org Artificial IntelligenceSep-21-2023

This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: universalities with respect to source types and recording conditions. The former property has been studied in the USS literature, which has greatly increased the number of source types that can be handled by a single neural network. However, the latter property (e.g., SF) has received less attention despite its necessity. Since the SF varies widely depending on the downstream tasks, the universal source separator must handle a wide variety of SFs. In this paper, to encompass the two properties, we propose an SF-independent (SFI) extension of a computationally efficient USS network, SuDoRM-RF. The proposed network uses our previously proposed SFI convolutional layers, which can handle various SFs by generating convolutional kernels in accordance with an input SF. Experiments show that signal resampling can degrade the USS performance and the proposed method works more consistently than signal-resampling-based methods for various SFs.

artificial intelligence, machine learning, separation

arXiv.org Artificial Intelligence

2309.12581

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.53)

Add feedback

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Koizumi, Yuma, Zen, Heiga, Karita, Shigeki, Ding, Yifan, Yatabe, Kohei, Morioka, Nobuyuki, Zhang, Yu, Han, Wei, Bapna, Ankur, Bacchiani, Michiel

arXiv.org Artificial IntelligenceAug-14-2023

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2303.01664

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (0.35)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration

Koizumi, Yuma, Yatabe, Kohei, Zen, Heiga, Bacchiani, Michiel

arXiv.org Artificial IntelligenceOct-3-2022

Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called \textit{WaveFit}, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at \url{google.github.io/df-conformer/wavefit/}.

artificial intelligence, iteration, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2210.01029

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Gamma Boltzmann Machine for Simultaneously Modeling Linear- and Log-amplitude Spectra

Nakashika, Toru, Yatabe, Kohei

arXiv.org Machine LearningJun-25-2020

In audio applications, one of the most important representations of audio signals is the amplitude spectrogram. It is utilized in many machine-learning-based information processing methods including the ones using the restricted Boltzmann machines (RBM). However, the ordinary Gaussian-Bernoulli RBM (the most popular RBM among its variations) cannot directly handle amplitude spectra because the Gaussian distribution is a symmetric model allowing negative values which never appear in the amplitude. In this paper, after proposing a general gamma Boltzmann machine, we propose a practical model called the gamma-Bernoulli RBM that simultaneously handles both linear- and log-amplitude spectrograms. Its conditional distribution of the observable data is given by the gamma distribution, and thus the proposed RBM can naturally handle the data represented by positive numbers as the amplitude spectra. It can also treat amplitude in the logarithmic scale which is important for audio signals from the perceptual point of view. The advantage of the proposed model compared to the ordinary Gaussian-Bernoulli RBM was confirmed by PESQ and MSE in the experiment of representing the amplitude spectrograms of speech signals.

artificial intelligence, boltzmann machine, neural network, (18 more...)

arXiv.org Machine Learning

2006.1359

Country: Asia > Japan (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.86)

Add feedback