Goto

Collaborating Authors

 Shen, Feiyu


Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

arXiv.org Artificial Intelligence

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.


On the Effectiveness of Acoustic BPE in Decoder-Only TTS

arXiv.org Artificial Intelligence

To address this issue, one possible way is to further compress the discrete speech sequence. A promising approach is Discretizing speech into tokens and generating them by a the acoustic byte-pair encoding (BPE) technique, which is proposed decoder-only model have been a promising direction for text-tospeech in [15]. It is a similar method to the traditional BPE algorithm (TTS) and spoken language modeling (SLM). To shorten [16] in natural language processing. It treats the discrete the sequence length of speech tokens, acoustic byte-pair encoding indexes of speech as literal characters and iteratively compresses (BPE) has emerged in SLM that treats speech tokens from consecutive tokens based on the frequency in the training self-supervised semantic representations as characters to further corpus. Such compression will coherently reduce sequence compress the token sequence. But the gain in TTS has not been length with the increase of vocabulary size. For speech discrete fully investigated, and the proper choice of acoustic BPE remains tokens, usually a group of multiple tokens occur together to represent unclear. In this work, we conduct a comprehensive study a specific phoneme or syllable, and organizing them to on various settings of acoustic BPE to explore its effectiveness be a unique modeling unit would provide a higher abstraction in decoder-only TTS models with semantic speech tokens.