AITopics | Gaido, Marco

Collaborating Authors

Gaido, Marco

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NUTSHELL: A Dataset for Abstract Generation from Scientific Talks

Züfle, Maike, Papi, Sara, Savoldi, Beatrice, Gaido, Marco, Bentivogli, Luisa, Niehues, Jan

arXiv.org Artificial IntelligenceFeb-24-2025

Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2502.16942

Country:

Europe (1.00)
Asia (0.68)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison

Lam, Tsz Kin, Gaido, Marco, Papi, Sara, Bentivogli, Luisa, Haddow, Barry

arXiv.org Artificial IntelligenceJan-4-2025

Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form in communication. To integrate speech into LLMs, one promising approach is dense feature prepending (DFP) which prepends the projected speech representations to the textual representations, allowing end-to-end training with the speech encoder. However, DFP typically requires connecting a text decoder to a speech encoder. This raises questions about the importance of having a sophisticated speech encoder for DFP, and how its performance compares with a standard encoder-decoder (i.e. cross-attention) architecture. In order to perform a controlled architectural comparison, we train all models from scratch, rather than using large pretrained models, and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. We study the influence of a speech encoder in DFP. More importantly, we compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, generation speed and GPU memory footprint on monolingual, bilingual and multilingual models. Despite the prevalence of DFP over cross-attention, our overall results do not indicate a clear advantage of DFP.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2501.0237

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Lee, Beomseok, Gaido, Marco, Calapodescu, Ioan, Besacier, Laurent, Negri, Matteo

arXiv.org Artificial IntelligenceDec-16-2024

As in any data-intensive domain, collecting highquality To fill this gap, this paper explores the use datasets is a fundamental and costly prerequisite of SFMs to automatize the validation of crowdsourced for the development of speech-processing speech data. To this aim, we investigate the applications. Traditional methods heavily rely on employment of off-the-shelf SFMs such as Whisper human workforce, whose costs, as data collection and SeamlessM4T (Radford et al., 2022; Communication scales, are hard to sustain. In the quest for scalable et al., 2023), along with machine translation solutions to tackle this problem, crowdsourcing (MT) models and grapheme-to-phoneme conversion emerged as a viable option that also enables the coverage (G2P). Through experiments on French, of diverse populations (Cefkin et al., 2014; German, and Korean data, we test the integration Poesio et al., 2017). Due to the variable quality of of SFMs and crowdsourcing to reduce validation crowd-sourced data, validation methods that discard costs while preserving final data quality. Our results low-quality contributions are essential to build show that leveraging SFMs yields a cost reduction reliable datasets (Negri et al., 2011; Sabou et al., by over 40%, while maintaining high data quality, 2014; Chittilappilly et al., 2016). This need is exacerbated significantly improving the efficiency and scalability in the collection of speech-text pairs, where of crowd-sourced speech data collection.

artificial intelligence, data quality, social media, (16 more...)

arXiv.org Artificial Intelligence

2412.11978

Country:

Europe (1.00)
North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Communications > Social Media > Crowdsourcing (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)

Add feedback

How to Connect Speech Foundation Models and Large Language Models? What Matters and What Does Not

Verdini, Francesco, Melucci, Pierfrancesco, Perna, Stefano, Cariaggi, Francesco, Gaido, Marco, Papi, Sara, Mazurek, Szymon, Kasztelnik, Marek, Bentivogli, Luisa, Bratières, Sébastien, Merialdo, Paolo, Scardapane, Simone

arXiv.org Artificial IntelligenceNov-8-2024

The remarkable performance achieved by Large Language Models (LLM) has driven research efforts to leverage them for a wide range of tasks and input modalities. In speech-to-text (S2T) tasks, the emerging solution consists of projecting the output of the encoder of a Speech Foundational Model (SFM) into the LLM embedding space through an adapter module. However, no work has yet investigated how much the downstream-task performance depends on each component (SFM, adapter, LLM) nor whether the best design of the adapter depends on the chosen SFM and LLM. To fill this gap, we evaluate the combination of 5 adapter modules, 2 LLMs (Mistral and Llama), and 2 SFMs (Whisper and SeamlessM4T) on two widespread S2T tasks, namely Automatic Speech Recognition and Speech Translation. Our results demonstrate that the SFM plays a pivotal role in downstream performance, while the adapter choice has moderate impact and depends on the SFM and LLM.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.17044

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation

Fucci, Dennis, Gaido, Marco, Savoldi, Beatrice, Negri, Matteo, Cettolo, Mauro, Bentivogli, Luisa

arXiv.org Artificial IntelligenceNov-3-2024

Spurred by the demand for interpretable models, research on eXplainable AI for language technologies has experienced significant growth, with feature attribution methods emerging as a cornerstone of this progress. While prior work in NLP explored such methods for classification tasks and textual applications, explainability intersecting generation and speech is lagging, with existing techniques failing to account for the autoregressive nature of state-of-the-art models and to provide fine-grained, phonetically meaningful explanations. We address this gap by introducing Spectrogram Perturbation for Explainable Speech-to-text Generation (SPES), a feature attribution technique applicable to sequence generation tasks with autoregressive models. SPES provides explanations for each predicted token based on both the input spectrogram and the previously generated tokens. Extensive evaluation on speech recognition and translation demonstrates that SPES generates explanations that are faithful and plausible to humans.

explanation, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.0171

Country:

Europe (1.00)
Asia (1.00)
North America > Canada (0.67)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

Gaido, Marco, Papi, Sara, Bentivogli, Luisa, Brutti, Alessio, Cettolo, Mauro, Gretter, Roberto, Matassoni, Marco, Nabih, Mohamed, Negri, Matteo

arXiv.org Artificial IntelligenceOct-1-2024

The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.01036

Country:

Europe > France (0.29)
Europe > Romania (0.28)
Asia > Middle East > Oman (0.14)
North America > United States > Minnesota (0.14)

Genre: Research Report (0.50)

Industry: Government > Regional Government > Europe Government (0.88)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection

Papi, Sara, Gaido, Marco, Negri, Matteo, Bentivogli, Luisa

arXiv.org Artificial IntelligenceJun-10-2024

Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream. Unlike simultaneous ST (SimulST), which deals with pre-segmented speech, StreamST faces the challenges of handling continuous and unbounded audio streams. This requires additional decisions about what to retain of the previous history, which is impractical to keep entirely due to latency and computational constraints. Despite the real-world demand for real-time ST, research on streaming translation remains limited, with existing works solely focusing on SimulST. To fill this gap, we introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric designed to be comparable with existing metrics for SimulST. Extensive experiments across all 8 languages of MuST-C v1.0 show the effectiveness of StreamAtt compared to a naive streaming baseline and the related state-of-the-art SimulST policy, providing a first step in StreamST research.

machine learning, natural language, translation, (16 more...)

arXiv.org Artificial Intelligence

2406.06097

Country:

Europe (1.00)
North America > United States > Pennsylvania (0.14)
Asia > Middle East > UAE (0.14)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

Gaido, Marco, Papi, Sara, Negri, Matteo, Bentivogli, Luisa

arXiv.org Artificial IntelligenceMay-17-2024

The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.12025

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

SBAAM! Eliminating Transcript Dependency in Automatic Subtitling

Gaido, Marco, Papi, Sara, Negri, Matteo, Cettolo, Mauro, Bentivogli, Luisa

arXiv.org Artificial IntelligenceMay-17-2024

Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution's new state-of-the-art performance across multiple language pairs and diverse conditions.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.10741

Country:

North America > United States > Pennsylvania (0.14)
North America > United States > Colorado (0.14)
Europe > Portugal > Lisbon > Lisbon (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report (1.00)

Industry:

Media (0.48)
Leisure & Entertainment (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.95)

Add feedback

How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

Gaido, Marco, Papi, Sara, Negri, Matteo, Bentivogli, Luisa

arXiv.org Artificial IntelligenceFeb-20-2024

The attention mechanism, a cornerstone of state-of-the-art neural models, faces computational hurdles in processing long sequences due to its quadratic complexity. Consequently, research efforts in the last few years focused on finding more efficient alternatives. Among them, Hyena (Poli et al., 2023) stands out for achieving competitive results in both language modeling and image classification, while offering sub-quadratic memory and computational complexity. Building on these promising results, we propose ConfHyena, a Conformer whose encoder self-attentions are replaced with an adaptation of Hyena for speech processing, where the long input sequences cause high computational costs. Through experiments in automatic speech recognition (for English) and translation (from English into 8 target languages), we show that our best ConfHyena model significantly reduces the training time by 27%, at the cost of minimal quality degradation ( 1%), which, in most cases, is not statistically significant.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2402.13208

Country:

Europe (1.00)
North America > United States > Pennsylvania (0.14)
North America > Canada > Alberta (0.14)

Genre:

Research Report > Experimental Study (0.66)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback