Not enough data to create a plot.
Try a different view from the menu above.
Peng, Junyi
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Tian, Jinchuan, Shi, Jiatong, Chen, William, Arora, Siddhant, Masuyama, Yoshiki, Maekaku, Takashi, Wu, Yihan, Peng, Junyi, Bharadwaj, Shikhar, Zhao, Yiwen, Cornell, Samuele, Peng, Yifan, Yue, Xiang, Yang, Chao-Han Huck, Neubig, Graham, Watanabe, Shinji
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.
Investigation of Speaker Representation for Target-Speaker Speech Processing
Ashihara, Takanori, Moriya, Takafumi, Horiguchi, Shota, Peng, Junyi, Ochiai, Tsubasa, Delcroix, Marc, Matsuura, Kohei, Sato, Hiroshi
Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. To further understand the properties of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis reveals that speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.