linguistic representation
New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
Lu, Xugang, Shen, Peng, Tsao, Yu, Kawai, Hisashi
Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.
A Implementation Details
The details of hyperparameter are described in Table 9. We conduct the ASR evaluation and ASV evaluation to compare the above methods. Following (Choi et al., 2021), we average each representation from Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of Table 11 shows that the adaptation quality is improved with an increase in the number of samples. Phoneme predictor We conduct the ablation study of phoneme predictor. Following (Kim et al., 2021), we remove a bias parameter of phoneme predictor, which causes unstable training during mixed precision training.
From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models
Ersoy, Asım, Mousi, Basel, Chowdhury, Shammur, Alam, Firoj, Dalvi, Fahim, Durrani, Nadir
The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Middle East > Qatar (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
Transfer the linguistic representations from TTS to accent conversion with non-parallel data
Chen, Xi, Pei, Jiakun, Xue, Liumeng, Zhang, Mingyang
Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.
Generative linguistic representation for spoken language identification
Shen, Peng, Lu, Xuguang, Kawai, Hisashi
Ren et al. proposed a two-step training process, which first trains Effective extraction and application of linguistic features are an acoustic model with a connectionist temporal classification central to the enhancement of spoken Language IDentification (CTC), then a recurrent neural network classifies the language (LID) performance. With the success of recent large category using the intermediate features derived from models, such as GPT and Whisper, the potential to leverage the acoustic model as inputs [10]. Multi-task training methods such pre-trained models for extracting linguistic features for have also been investigated, which enhance performance LID tasks has become a promising area of research. In this paper, and bolster model robustness. This method utilizes the shared we explore the utilization of the decoder-based network underlying feature extraction network and jointly trains objective from the Whisper model to extract linguistic features through functions for speech/phoneme recognition and language its generative mechanism for improving the classification accuracy recognition [9, 11, 12]. Consideration has also been given in LID tasks. We devised two strategies - one based to self-supervised phonotactic representations that use context on the language embedding method and the other focusing information [13, 14].
HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer
Lee, Sang-Hoon, Choi, Ha-Yeong, Oh, Hyung-Seok, Lee, Seong-Whan
Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Linguistic representations for fewer-shot relation extraction across domains
Gururaja, Sireesh, Dutt, Ritam, Liao, Tinglong, Rose, Carolyn
Recent work has demonstrated the positive impact of incorporating linguistic representations as additional context and scaffolding on the in-domain performance of several NLP tasks. We extend this work by exploring the impact of linguistic representations on cross-domain performance in a few-shot transfer setting. An important question is whether linguistic representations enhance generalizability by providing features that function as cross-domain pivots. We focus on the task of relation extraction on three datasets of procedural text in two domains, cooking and materials science. Our approach augments a popular transformer-based architecture by alternately incorporating syntactic and semantic graphs constructed by freely available off-the-shelf tools. We examine their utility for enhancing generalization, and investigate whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains. We find that while the inclusion of these graphs results in significantly higher performance in few-shot transfer, both types of graph exhibit roughly equivalent utility.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (8 more...)
Emergent Linguistic Structures in Neural Networks are Fragile
La Malfa, Emanuele, Wicker, Matthew, Kwiatkowska, Marta
Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structures. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- (14 more...)