AITopics | Mengibar, Pedro Moreno

Collaborating Authors

Mengibar, Pedro Moreno

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers

Stooke, Adam, Prabhavalkar, Rohit, Sim, Khe Chai, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceFeb-6-2025

Modern systems for automatic speech recognition, including the RNN-Transducer and Attention-based Encoder-Decoder (AED), are designed so that the encoder is not required to alter the time-position of information from the audio sequence into the embedding; alignment to the final text output is processed during decoding. We discover that the transformer-based encoder adopted in recent years is actually capable of performing the alignment internally during the forward pass, prior to decoding. This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder". To train it, we discard the dynamic programming of RNN-T in favor of the frame-wise cross-entropy loss of AED, while the decoder employs the lighter text-only recurrence of RNN-T without learned cross-attention -- it simply scans embedding frames in order from the beginning, producing one token each until predicting the end-of-message. We conduct experiments demonstrating performance remarkably close to the state of the art, including a special inference configuration enabling long-form recognition. In a representative comparison, we measure the total inference time for our model to be 2x faster than RNN-T and 16x faster than AED. Lastly, we find that the audio-text alignment is clearly visible in the self-attention weights of a certain layer, which could be said to perform "self-transduction".

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2502.05232

Country: Asia (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

TransformerFAM: Feedback attention is working memory

Hwang, Dongseong, Wang, Weiran, Huo, Zhuoyuan, Sim, Khe Chai, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceMay-7-2024

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.09173

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hierarchical Recurrent Adapters for Efficient Multi-Task Adaptation of Large Speech Models

Munkhdalai, Tsendsuren, Chen, Youzheng, Sim, Khe Chai, Biadsy, Fadi, Sainath, Tara, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceMar-25-2024

Parameter efficient adaptation methods have become a key mechanism to train large pre-trained models for downstream tasks. However, their per-task parameter overhead is considered still high when the number of downstream tasks to adapt for is large. We introduce an adapter module that has a better efficiency in large scale multi-task adaptation scenario. Our adapter is hierarchical in terms of how the adapter parameters are allocated. The adapter consists of a single shared controller network and multiple task-level adapter heads to reduce the per-task parameter overhead without performance regression on downstream tasks. The adapter is also recurrent so the entire adapter parameters are reused across different layers of the pre-trained model. Our Hierarchical Recurrent Adapter (HRA) outperforms the previous adapter-based approaches as well as full model fine-tuning baseline in both single and multi-task adaptation settings when evaluated on automatic speech recognition tasks.

adapter head, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2403.19709

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech (0.70)

Add feedback

Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition

Ngai, Hillary, Agrawal, Rohan, Gaur, Neeraj, Huang, Ronny, Haghani, Parisa, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceOct-17-2023

Adapters are an efficient, composable alternative to full fine-tuning of pre-trained models and help scale the deployment of large ASR models to many tasks. In practice, a task ID is commonly prepended to the input during inference to route to single-task adapters for the specified task. However, one major limitation of this approach is that the task ID may not be known during inference, rendering it unsuitable for most multi-task settings. To address this, we propose three novel task-ID-free methods to combine single-task adapters in multi-task ASR and investigate two learning algorithms for training. We evaluate our methods on 10 test sets from 4 diverse ASR tasks and show that our methods are non-destructive and parameter-efficient. While only updating 17% of the model parameters, our methods can achieve an 8% mean WER improvement relative to full fine-tuning and are on-par with task-ID adapter routing.

audio-adapterfusion, non-destructive multi-task speech recognition, task-id-free approach

arXiv.org Artificial Intelligence

2310.13015

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.40)

Add feedback

Contextual Biasing with the Knuth-Morris-Pratt Matching Algorithm

Wang, Weiran, Wu, Zelin, Caseiro, Diamantino, Munkhdalai, Tsendsuren, Sim, Khe Chai, Rondon, Pat, Pundak, Golan, Song, Gan, Prabhavalkar, Rohit, Meng, Zhong, Zhao, Ding, Sainath, Tara, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceSep-29-2023

Contextual biasing refers to the problem of biasing the automatic speech recognition (ASR) systems towards rare entities that are relevant to the specific user or application scenarios. We propose algorithms for contextual biasing based on the Knuth-Morris-Pratt algorithm for pattern matching. During beam search, we boost the score of a token extension if it extends matching into a set of biasing phrases. Our method simulates the classical approaches often implemented in the weighted finite state transducer (WFST) framework, but avoids the FST language altogether, with careful considerations on memory footprint and efficiency on tensor processing units (TPUs) by vectorization. Without introducing additional model parameters, our method achieves significant word error rate (WER) reductions on biasing test sets by itself, and yields further performance gain when combined with a model-based biasing method.

knuth-morris-pratt matching algorithm, machine learning, speech recognition, (2 more...)

arXiv.org Artificial Intelligence

2310.00178

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.87)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.53)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.53)

Add feedback

Improving Speech Recognition for African American English With Audio Classification

Garg, Shefali, Huo, Zhouyuan, Sim, Khe Chai, Schwartz, Suzan, Chua, Mason, Aksënova, Alëna, Munkhdalai, Tsendsuren, King, Levi, Wright, Darryl, Mengesha, Zion, Hwang, Dongseong, Sainath, Tara, Beaufays, Françoise, Mengibar, Pedro Moreno

arXiv.org Artificial IntelligenceSep-16-2023

Automatic speech recognition (ASR) systems have been shown to have large quality disparities between the language varieties they are intended or expected to recognize. One way to mitigate this is to train or fine-tune models with more representative datasets. But this approach can be hindered by limited in-domain data for training and evaluation. We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain (long-form) African American English (AAE) data. We use CORAAL, YouTube and Mozilla Common Voice to train an audio classifier to approximately output whether an utterance is AAE or some other variety including Mainstream American English (MAE). By combining the classifier output with coarse geographic information, we can select a subset of utterances from a large corpus of untranscribed short-form queries for semi-supervised learning at scale. Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.

artificial intelligence, machine learning, utterance, (15 more...)

arXiv.org Artificial Intelligence

2309.09996

Country:

North America > United States (0.15)
Europe (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback