AITopics | Takamichi, Shinnosuke

Collaborating Authors

Takamichi, Shinnosuke

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

Wu, Bin, Takamichi, Shinnosuke, Sakti, Sakriani, Nakamura, Satoshi

arXiv.org Artificial IntelligenceNov-21-2024

Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification, and caller identification for marmoset vocalizations. However, the CNN has limitations in modeling long-range acoustic patterns; the Transformer architecture that has been shown to outperform CNNs, utilizes the self-attention mechanism that efficiently segregates information parallelly over long distances and captures the global structure of marmoset vocalization. We propose using the Transformer to jointly segment and classify the marmoset calls and identify the callers for each vocalization.

artificial intelligence, machine learning, marmoset, (20 more...)

arXiv.org Artificial Intelligence

2410.23279

Country: Asia > China (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

Jung, Jee-weon, Wu, Yihan, Wang, Xin, Kim, Ji-Hoon, Maiti, Soumi, Matsunaga, Yuta, Shim, Hye-jin, Tian, Jinchuan, Evans, Nicholas, Chung, Joon Son, Zhang, Wangyou, Um, Seyun, Takamichi, Shinnosuke, Watanabe, Shinji

arXiv.org Artificial IntelligenceSep-18-2024

This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with different levels of noise to be trained. However, existing datasets typically include clean, high-quality recordings (bona fide data) due to the requirements for TTS training; studio-quality or well-recorded read speech is typically necessary to train TTS models. Existing SDD datasets also have limited usefulness for training SASV models due to insufficient speaker diversity. We present SpoofCeleb, which leverages a fully automated pipeline that processes the VoxCeleb1 dataset, transforming it into a suitable form for TTS training. We subsequently train 23 contemporary TTS systems. The resulting SpoofCeleb dataset comprises over 2.5 million utterances from 1,251 unique speakers, collected under natural, real-world conditions. The dataset includes carefully partitioned training, validation, and evaluation sets with well-controlled experimental protocols. We provide baseline results for both SDD and SASV tasks. All data, protocols, and baselines are publicly available at https://jungjee.github.io/spoofceleb.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2409.17285

Country:

North America > United States (0.46)
Asia > China (0.28)
Asia > South Korea (0.28)
Asia > Japan > Honshū > Kantō (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Textless Dependency Parsing by Labeled Sequence Prediction

Kando, Shunsuke, Miyao, Yusuke, Naradowsky, Jason, Takamichi, Shinnosuke

arXiv.org Artificial IntelligenceJul-14-2024

Traditional spoken language processing involves cascading an automatic speech recognition (ASR) system into text processing models. In contrast, "textless" methods process speech representations without ASR systems, enabling the direct use of acoustic speech features. Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper proposes a textless method for dependency parsing, examining its effectiveness and limitations. Our proposed method predicts a dependency tree from a speech signal without transcribing, representing the tree as a labeled sequence. scading method outperforms the textless method in overall parsing accuracy, the latter excels in instances with important acoustic features. Our findings highlight the importance of fusing word-level representations and sentence-level prosody for enhanced parsing performance. The code and models are made publicly available: https://github.com/mynlp/SpeechParser.

artificial intelligence, natural language, textless dependency parsing, (3 more...)

arXiv.org Artificial Intelligence

2407.10118

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

YODAS: Youtube-Oriented Dataset for Audio and Speech

Li, Xinjian, Takamichi, Shinnosuke, Saeki, Takaaki, Chen, William, Shiota, Sayaka, Watanabe, Shinji

arXiv.org Artificial IntelligenceJun-2-2024

In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.00899

Country:

Asia > Japan (0.14)
North America > United States (0.14)
Europe > Germany (0.14)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng

arXiv.org Artificial IntelligenceMay-19-2024

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2404.03204

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Matsunaga, Yuta, Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceSep-19-2023

We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the position-word entanglement of filled pauses, i.e., the necessity of precisely predicting positions for naturalness and the necessity of precisely predicting words for individuality on the evaluation of synthesized speech.

artificial intelligence, machine learning, speech, (18 more...)

arXiv.org Artificial Intelligence

2210.07559

Country:

North America > United States > California (0.14)
Asia > Japan (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Do learned speech symbols follow Zipf's law?

Takamichi, Shinnosuke, Maeda, Hiroki, Park, Joonyong, Saito, Daisuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceSep-18-2023

In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols. Zipf's law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized to comply with this law. On the other hand, recent breakthroughs in spoken language processing have given rise to the development of learned speech symbols; these are data-driven symbolizations of speech content. Our objective is to ascertain whether these data-driven speech symbols follow Zipf's law, as the same as natural language symbols. Through our investigation, we aim to forge new ways for the statistical analysis of spoken language processing.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2309.0969

Country: Asia > Japan (0.14)

Genre: Research Report > New Finding (0.88)

Industry: Law (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

Park, Joonyong, Takamichi, Shinnosuke, Nakamura, Tomohiko, Seki, Kentaro, Xin, Detai, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceJun-1-2023

We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels. Through speech resynthesis experiments, we revealed that resynthesis errors occur at the levels ranging from phonology to syntactics and GSLM frequently resynthesizes natural but content-altered speech.

artificial intelligence, natural language, speech, (19 more...)

arXiv.org Artificial Intelligence

2306.00697

Country: Asia > Japan (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.62)

Add feedback

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Saeki, Takaaki, Maiti, Soumi, Li, Xinjian, Watanabe, Shinji, Takamichi, Shinnosuke, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceMay-27-2023

While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource languages for which only textual resources are available, making TTS accessible to thousands of languages. Inspired by the strong cross-lingual transferability of multilingual language models, our framework first performs masked language model pretraining with multilingual text-only data. Then we train this model with a paired data in a supervised manner, while freezing a language-aware embedding layer. This allows inference even for languages not included in the paired data but present in the text-only data. Evaluation results demonstrate highly intelligible zero-shot TTS with a character error rate of less than 12% for an unseen language.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2301.12596

Country: Asia > Japan > Honshū (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.72)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.71)

Add feedback

CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

Saito, Yuki, Iimori, Eiji, Takamichi, Shinnosuke, Tachibana, Kentaro, Saruwatari, Hiroshi

arXiv.org Artificial IntelligenceMay-23-2023

We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teacher, acting as an operator in simulated phone calls. We describe a corpus construction methodology and analyze the recorded speech. We also conduct EDSS experiments using the CALLS and STUDIES corpora to investigate the effect of domain differences. The results show that mixing the two corpora during training causes biased improvements in the quality of synthetic speech due to the different degrees of expressiveness. Our project page of the corpus is http://sython.org/Corpus/STUDIES-2.

corpus, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.13713

Country:

North America > United States (0.46)
Asia > South Korea (0.29)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.67)

Industry: Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.37)

Add feedback