Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Miao, Haoran, Cheng, Gaofeng, Gao, Changfeng, Zhang, Pengyuan, Yan, Yonghong

Jan-15-2020–arXiv.org Machine Learning

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as $0.19\%$ absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

architecture, international conference, speech recognition, (13 more...)

arXiv.org Machine Learning

Jan-15-2020

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada > British Columbia
    - Vancouver (0.04)
- Europe
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia > China
  - Xinjiang Uygur Autonomous Region (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found