JOOCI: a Framework for Learning Comprehensive Speech Representations
Yadav, Hemant, Shah, Rajiv Ratn, Sitaram, Sunayana
–arXiv.org Artificial Intelligence
Information in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other). Current state-of-the-art (SOTA) techniques model speech at fixed segments, usually 10-25 ms, using a single embedding. Given the orthogonal nature of other and content information, attempting to optimize both within a single embedding results in suboptimal solutions. This approach divides the model's capacity, limiting its ability to build complex hierarchical features effectively. In this work, we present an end-to-end speech representation learning framework designed to jointly optimize the "other" and "content" information (JOOCI) in speech. Our results show that JOOCI consistently outperforms other SOTA models of similar size (100 million parameters) and pre-training data used (960 hours) by a significant margin when evaluated on a range of speech downstream tasks in the SUPERB benchmark, as shown in Table 1. Code and models are available at TBA. Self-supervised learning (SSL) has played a significant role in learning high-level representations of text (Brown et al., 2020), vision (Alexey, 2020), and audio (Baevski et al., 2020; Mohamed et al., 2022; Défossez et al., 2022) data. In this work, we focus on learning high-level representations from raw speech.
arXiv.org Artificial Intelligence
Oct-16-2024