HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Xiao, Cihan, Xinyuan, Henry Li, Yang, Jinyi, Gao, Dongji, Wiesner, Matthew, Duh, Kevin, Khudanpur, Sanjeev

Jun-19-2023–arXiv.org Artificial Intelligence

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

Jun-19-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Maryland > Baltimore (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
- Europe
  - Germany
    - Berlin (0.04)
    - Baden-Württemberg > Karlsruhe Region
      - Heidelberg (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.05)
- Asia
  - China > Hong Kong (0.05)
  - Singapore (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - Japan > Kyūshū & Okinawa
    - Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)

Genre:
- Research Report (0.64)

Industry:
- Media (0.34)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Natural Language > Machine Translation (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found