AITopics | Gállego, Gerard I.

Collaborating Authors

Gállego, Gerard I.

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Unveiling the Role of Pretraining in Direct Speech Translation

Alastruey, Belen, Gállego, Gerard I., Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceSep-26-2024

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.

information, machine learning, translation, (20 more...)

arXiv.org Artificial Intelligence

2409.18044

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Pushing the Limits of Zero-shot End-to-End Speech Translation

Tsiamas, Ioannis, Gállego, Gerard I., Fonollosa, José A. R., Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceFeb-15-2024

Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2402.10422

Country:

Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Alastruey, Belen, Sant, Aleix, Gállego, Gerard I., Dale, David, Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceSep-20-2023

Speech-to-Speech and Speech-to-Text translation are currently dynamic areas of research. To contribute to these fields, we present SpeechAlign, a framework to evaluate the underexplored field of source-target alignment in speech models. Our framework has two core components. First, to tackle the absence of suitable evaluation datasets, we introduce the Speech Gold Alignment dataset, built upon a English-German text translation gold alignment dataset. Secondly, we introduce two novel metrics, Speech Alignment Error Rate (SAER) and Time-weighted Speech Alignment Error Rate (TW-SAER), to evaluate alignment quality in speech models. By publishing SpeechAlign we provide an accessible evaluation framework for model assessment, and we employ it to benchmark open-source Speech Translation models.

artificial intelligence, speech recognition, speech translation alignment evaluation, (2 more...)

arXiv.org Artificial Intelligence

2309.11585

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.60)

Add feedback

Speech Translation with Foundation Models and Optimal Transport: UPC at IWSLT23

Tsiamas, Ioannis, Gállego, Gerard I., Fonollosa, José A. R., Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceJun-2-2023

Gállego et al. (2021); Zhao et al. (2022) aimed to Han et al. (2021) tackled the issue by projecting speech and text features In the past decade, the field of Speech Translation (ST) has seen significant advancements, mainly In our work, we tackle the issue of misaligned due to end-to-end models that directly translate speech and text encoder representations by adopting speech, offering a more efficient method compared the approach proposed by Le et al. (2023). Despite data availability challenges, recent on English ASR, wav2vec 2.0 (Baevski et al., progress has diminished the performance disparity 2020), and an MT foundation model fine-tuned between these approaches (Bentivogli et al., 2021; on multilingual MT (En-Xx), mBART50 (Tang Potapczyk and Przybysz, 2020; Inaguma et al., et al., 2020), as described in Section 2.1.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.01327

Country:

Asia (1.00)
Europe (0.93)
North America > United States > Minnesota (0.14)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Explaining How Transformers Use Context to Build Predictions

Ferrando, Javier, Gállego, Gerard I., Tsiamas, Ioannis, Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceMay-21-2023

Language Generation Models produce words based on the previous context. Although existing methods offer input attributions as explanations for a model's prediction, it is still unclear how prior words affect the model's decision throughout the layers. In this work, we leverage recent advances in explainability of the Transformer and present a procedure to analyze models for language generation. Using contrastive examples, we compare the alignment of our explanations with evidence of the linguistic phenomena, and show that our method consistently aligns better than gradient-based and perturbation-based baselines. Then, we investigate the role of MLPs inside the Transformer and show that they learn features that help the model predict words that are grammatically acceptable. Lastly, we apply our method to Neural Machine Translation models, and demonstrate that they generate human-like source-target alignments for building predictions.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.12535

Country:

Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Sign Language Translation from Instructional Videos

Tarrés, Laia, Gállego, Gerard I., Duarte, Amanda, Torres, Jordi, Giró-i-Nieto, Xavier

arXiv.org Artificial IntelligenceApr-14-2023

The advances in automatic sign language translation (SLT) to spoken languages have been mostly benchmarked with datasets of limited size and restricted domains. Our work advances the state of the art by providing the first baseline results on How2Sign, a large and broad dataset. We train a Transformer over I3D video features, using the reduced BLEU as a reference metric for validation, instead of the widely used BLEU score. We report a result of 8.03 on the BLEU score, and publish the first open-source implementation of its kind to promote further advances.

machine learning, natural language, translation, (18 more...)

arXiv.org Artificial Intelligence

2304.06371

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.46)
Instructional Material > Course Syllabus & Notes (0.40)

Industry:

Education > Educational Technology (0.86)
Education > Curriculum > Subject-Specific Education (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Speech Translation with Dynamic Latent Perceivers

Tsiamas, Ioannis, Gállego, Gerard I., Fonollosa, José A. R., Costa-jussà, Marta R.

arXiv.org Artificial IntelligenceMar-14-2023

Transformers have been the dominant architecture for Speech Translation in recent years, achieving significant improvements in translation quality. Since speech signals are longer than their textual counterparts, and due to the quadratic complexity of the Transformer, a down-sampling step is essential for its adoption in Speech Translation. Instead, in this research, we propose to ease the complexity by using a Perceiver encoder to map the speech inputs to a fixed-length latent representation. Furthermore, we introduce a novel way of training Perceivers, with Dynamic Latent Access (DLA), unlocking larger latent spaces without any additional computational overhead. Speech-to-Text Perceivers with DLA can match the performance of Transformer baselines across three language pairs in MuST-C. Finally, a DLA-trained model is easily adaptable to DLA at inference, and can be flexibly deployed with various computational budgets, without significant drops in translation quality.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2210.16264

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback