AITopics | asr encoder

Collaborating Authors

asr encoder

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Qieemo: Speech Is All You Need in the Emotion Recognition in Conversations

Chen, Jinming, Fang, Jingyi, Zheng, Yuanzhong, Wang, Yaoxuan, Fei, Haojun

arXiv.org Artificial IntelligenceMar-5-2025

Emotion recognition plays a pivotal role in intelligent human-machine interaction systems. Multimodal approaches benefit from the fusion of diverse modalities, thereby improving the recognition accuracy. However, the lack of high-quality multimodal data and the challenge of achieving optimal alignment between different modalities significantly limit the potential for improvement in multimodal approaches. In this paper, the proposed Qieemo framework effectively utilizes the pretrained automatic speech recognition (ASR) model backbone which contains naturally frame aligned textual and emotional features, to achieve precise emotion classification solely based on the audio modality. Furthermore, we design the multimodal fusion (MMF) module and cross-modal attention (CMA) module in order to fuse the phonetic posteriorgram (PPG) and emotional features extracted by the ASR encoder for improving recognition accuracy. The experimental results on the IEMOCAP dataset demonstrate that Qieemo outperforms the benchmark unimodal, multimodal, and self-supervised models with absolute improvements of 3.0%, 1.2%, and 1.9% respectively.

artificial intelligence, emotion recognition, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.22687

Country: Asia > China > Shanghai > Shanghai (0.05)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Kamahori, Keisuke, Kasai, Jungo, Kojima, Noriyuki, Kasikci, Baris

arXiv.org Artificial IntelligenceFeb-27-2025

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

arxiv preprint arxiv, encoder, ite asr, (13 more...)

arXiv.org Artificial Intelligence

2502.20583

Genre: Research Report > New Finding (0.34)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on the Edge

Qin, Ruiyang, Liu, Dancheng, Xu, Gelei, Yan, Zheyu, Xu, Chenhui, Hu, Yuting, Hu, X. Sharon, Xiong, Jinjun, Shi, Yiyu

arXiv.org Artificial IntelligenceNov-26-2024

The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment), ultimately enabling a more personalized and efficient adaptation on edge devices. However, due to the complex training requirements and substantial computational demands of existing approaches, cross-modal alignment between ASR audio and LLM can be challenging on edge devices. In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. Our framework enables efficient ASR-LLM alignment on resource-constrained devices like NVIDIA Jetson Orin (8GB RAM), achieving 50x training time speedup while improving the alignment quality by more than 50\%. To the best of our knowledge, this is the first work to study efficient ASR-LLM alignment on resource-constrained edge devices.

alignment, bridgeformer, llm, (14 more...)

arXiv.org Artificial Intelligence

2411.13766

Country:

North America > United States (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine > Therapeutic Area > Neurology > Dementia (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Zhu, Pai, Bartel, Jacob W., Agarwal, Dhruuv, Partridge, Kurt, Park, Hyun Jin, Wang, Quan

arXiv.org Artificial IntelligenceOct-21-2024

We propose GE2E-KWS -- a generalized end-to-end training and evaluation framework for customized keyword spotting. Specifically, enrollment utterances are separated and grouped by keywords from the training batch and their embedding centroids are compared to all other test utterance embeddings to compute the loss. This simulates runtime enrollment and verification stages, and improves convergence stability and training speed by optimizing matrix operations compared to SOTA triplet loss approaches. To benchmark different models reliably, we propose an evaluation process that mimics the production environment and compute metrics that directly measure keyword matching accuracy. Trained with GE2E loss, our 419KB quantized conformer model beats a 7.5GB ASR encoder by 23.6% relative AUC, and beats a same size triplet loss model by 60.7% AUC. Our KWS models are natively streamable with low memory footprints, and designed to continuously run on-device with no retraining needed for new keywords (zero-shot).

large language model, machine learning, utterance, (20 more...)

arXiv.org Artificial Intelligence

2410.16647

Country:

North America > United States > New York (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study

An, Keyu, Zhang, Shiliang, Yan, Zhijie

arXiv.org Artificial IntelligenceSep-26-2024

Our underlying hypothesis posits that, despite being initially trained on text-based corpora, these transformers possess a remarkable capacity to extract effective features from the input sequence. This inherent capability, we argue, is transferrable to speech data, thereby augmenting the acoustic modeling ability of ASR. Through rigorous empirical analysis, our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated. Particularly, they serve as an advantageous starting point for initializing ASR encoders. Furthermore, we uncover that these transformers, when integrated into a well-established ASR encoder, can significantly boost performance, especially in scenarios where profound semantic comprehension is pivotal. This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.

asr encoder, encoder, transformer, (11 more...)

arXiv.org Artificial Intelligence

2409.1775

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Rhode Island (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
(3 more...)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

Shi, Xiaohan, He, Jiajun, Li, Xingfeng, Toda, Tomoki

arXiv.org Artificial IntelligenceNov-14-2023

Typically, three common approaches are used to address the issue of noisy This paper proposes an efficient attempt to noisy speech emotion speech emotion recognition (NSER): the signal level, the feature recognition (NSER). Conventional NSER approaches level, and the model level, as outlined by Tiwari et al have proven effective in mitigating the impact of artificial [2]. For instance, Pandharipande et al. [3] used a voice activity noise sources, such as white Gaussian noise, but are limited detector to reduce noise at the signal level. Lachiri et to non-stationary noises in real-world environments due to al. [4] introduced a novel approach involving MFCC-shifteddelta-cepstral their complexity and uncertainty. To overcome this limitation, coefficients at the feature level. Tiwari et al. [2] we introduce a new method for NSER by adopting the devised a generative noise model at the model level. The previously automatic speech recognition (ASR) model as a noise-robust mentioned studies have proven effective in mitigating feature extractor to eliminate non-vocal information in noisy the impact of common noise sources like white Gaussian speech. We first obtain intermediate layer information from noise on speech-related tasks. However, in real-world settings, the ASR model as a feature representation for emotional a distinct category of noise sounds, such as high-heeled speech and then apply this representation for the downstream shoes and door knocking, presents a substantial challenge.

emotion recognition, recognition, representation, (13 more...)

arXiv.org Artificial Intelligence

2311.07093

Genre: Research Report > New Finding (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.95)

Add feedback

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

Yang, Yuhang, Xu, Haihua, Huang, Hao, Chng, Eng Siong, Li, Sheng

arXiv.org Artificial IntelligenceNov-1-2022

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.

artificial intelligence, encoder, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2211.00325

Country:

Asia > Singapore (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback