Speech Foundation Models and Crowdsourcing for Efficient, High-Quality Data Collection

Lee, Beomseok, Gaido, Marco, Calapodescu, Ioan, Besacier, Laurent, Negri, Matteo

Dec-16-2024–arXiv.org Artificial Intelligence

As in any data-intensive domain, collecting highquality To fill this gap, this paper explores the use datasets is a fundamental and costly prerequisite of SFMs to automatize the validation of crowdsourced for the development of speech-processing speech data. To this aim, we investigate the applications. Traditional methods heavily rely on employment of off-the-shelf SFMs such as Whisper human workforce, whose costs, as data collection and SeamlessM4T (Radford et al., 2022; Communication scales, are hard to sustain. In the quest for scalable et al., 2023), along with machine translation solutions to tackle this problem, crowdsourcing (MT) models and grapheme-to-phoneme conversion emerged as a viable option that also enables the coverage (G2P). Through experiments on French, of diverse populations (Cefkin et al., 2014; German, and Korean data, we test the integration Poesio et al., 2017). Due to the variable quality of of SFMs and crowdsourcing to reduce validation crowd-sourced data, validation methods that discard costs while preserving final data quality. Our results low-quality contributions are essential to build show that leveraging SFMs yields a cost reduction reliable datasets (Negri et al., 2011; Sabou et al., by over 40%, while maintaining high data quality, 2014; Chittilappilly et al., 2016). This need is exacerbated significantly improving the efficiency and scalability in the collection of speech-text pairs, where of crowd-sourced speech data collection.

artificial intelligence, data quality, social media, (16 more...)

arXiv.org Artificial Intelligence

Dec-16-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - California > Los Angeles County
    - Los Angeles (0.04)
- Europe
  - United Kingdom > Scotland (0.04)
  - France (0.04)
  - Netherlands > South Holland
    - Dordrecht (0.04)
  - Italy > Trentino-Alto Adige/Südtirol
    - Trentino Province > Trento (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology
  - Artificial Intelligence > Speech (1.00)
  - Communications > Social Media
    - Crowdsourcing (1.00)