AITopics | linguistic representation

Collaborating Authors

linguistic representation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

69c754f571806bf15add18556ff39b4f-Paper-Conference.pdf

shlee

Neural Information Processing SystemsFeb-9-2026, 14:16:34 GMT

hierspeech, representation, self-supervised speech representation, (12 more...)

Neural Information Processing Systems

Country: Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

Lu, Xugang, Shen, Peng, Tsao, Yu, Kawai, Hisashi

arXiv.org Artificial IntelligenceSep-9-2025

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

alignment, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.05609

Country: Asia (0.46)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.35)

Add feedback

A Implementation Details

Neural Information Processing SystemsAug-15-2025, 14:05:50 GMT

The details of hyperparameter are described in Table 9. We conduct the ASR evaluation and ASV evaluation to compare the above methods. Following (Choi et al., 2021), we average each representation from Similar to the previous analysis of XLSR-53 (Choi et al., 2021), the representations from the 1st layer of XLS-R are already clustered by each speaker while it is hard to distinguish the representations of Table 11 shows that the adaptation quality is improved with an increase in the number of samples. Phoneme predictor We conduct the ablation study of phoneme predictor. Following (Kim et al., 2021), we remove a bias parameter of phoneme predictor, which causes unstable training during mixed precision training.

artificial intelligence, machine learning, representation, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis Sang-Hoon Lee 1 Seung-Bin Kim 2 Ji-Hyun Lee 2

shlee

Neural Information Processing SystemsAug-15-2025, 14:05:46 GMT

HierSpeech-U can adapt to a novel speaker by utilizing self-supervised speech representations without text transcripts.

artificial intelligence, machine learning, representation, (15 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.46)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

Ersoy, Asım, Mousi, Basel, Chowdhury, Shammur, Alam, Firoj, Dalvi, Fahim, Durrani, Nadir

arXiv.org Artificial IntelligenceJun-3-2025

The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.01133

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Chen, Xi, Pei, Jiakun, Xue, Liumeng, Zhang, Mingyang

arXiv.org Artificial IntelligenceJan-7-2024

Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.

accent conversion, representation, speech, (14 more...)

arXiv.org Artificial Intelligence

2401.03538

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Generative linguistic representation for spoken language identification

Shen, Peng, Lu, Xuguang, Kawai, Hisashi

arXiv.org Artificial IntelligenceDec-18-2023

Ren et al. proposed a two-step training process, which first trains Effective extraction and application of linguistic features are an acoustic model with a connectionist temporal classification central to the enhancement of spoken Language IDentification (CTC), then a recurrent neural network classifies the language (LID) performance. With the success of recent large category using the intermediate features derived from models, such as GPT and Whisper, the potential to leverage the acoustic model as inputs [10]. Multi-task training methods such pre-trained models for extracting linguistic features for have also been investigated, which enhance performance LID tasks has become a promising area of research. In this paper, and bolster model robustness. This method utilizes the shared we explore the utilization of the decoder-based network underlying feature extraction network and jointly trains objective from the Whisper model to extract linguistic features through functions for speech/phoneme recognition and language its generative mechanism for improving the classification accuracy recognition [9, 11, 12]. Consideration has also been given in LID tasks. We devised two strategies - one based to self-supervised phonotactic representations that use context on the language embedding method and the other focusing information [13, 14].

dataset, representation, whisper model, (15 more...)

arXiv.org Artificial Intelligence

2312.10964

Country: Asia > Japan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Lee, Sang-Hoon, Choi, Ha-Yeong, Oh, Hyung-Seok, Lee, Seong-Whan

arXiv.org Artificial IntelligenceJul-30-2023

Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}.

linguistic representation, representation, speech, (14 more...)

arXiv.org Artificial Intelligence

2307.16171

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Linguistic representations for fewer-shot relation extraction across domains

Gururaja, Sireesh, Dutt, Ritam, Liao, Tinglong, Rose, Carolyn

arXiv.org Artificial IntelligenceJul-7-2023

Recent work has demonstrated the positive impact of incorporating linguistic representations as additional context and scaffolding on the in-domain performance of several NLP tasks. We extend this work by exploring the impact of linguistic representations on cross-domain performance in a few-shot transfer setting. An important question is whether linguistic representations enhance generalizability by providing features that function as cross-domain pivots. We focus on the task of relation extraction on three datasets of procedural text in two domains, cooking and materials science. Our approach augments a popular transformer-based architecture by alternately incorporating syntactic and semantic graphs constructed by freely available off-the-shelf tools. We examine their utility for enhancing generalization, and investigate whether earlier findings, e.g. that semantic representations can be more helpful than syntactic ones, extend to relation extraction in multiple domains. We find that while the inclusion of these graphs results in significantly higher performance in few-shot transfer, both types of graph exhibit roughly equivalent utility.

artificial intelligence, natural language, text processing, (18 more...)

arXiv.org Artificial Intelligence

2307.03823

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(8 more...)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)

Add feedback

Emergent Linguistic Structures in Neural Networks are Fragile

La Malfa, Emanuele, Wicker, Matthew, Kwiatkowska, Marta

arXiv.org Artificial IntelligenceMay-31-2023

Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structures. In this paper, focusing on the ability of language models to represent syntax, we propose a framework to assess the consistency and robustness of linguistic representations. To this end, we introduce measures of robustness of neural network models that leverage recent advances in extracting linguistic constructs from LLMs via probing tasks, i.e., simple tasks used to extract meaningful information about a single facet of a language model, such as syntax reconstruction and root identification. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures by analysing their performance and robustness with respect to syntax-preserving perturbations. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving perturbations. Our key observation is that emergent syntactic representations in neural networks are brittle. We make the code, trained models and logs available to the community as a contribution to the debate about the capabilities of LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.17406

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > China > Hong Kong (0.04)
(14 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback