Bridging the Granularity Gap for Acoustic Modeling

Xu, Chen, Zhang, Yuhao, Jiao, Chengbo, Liu, Xiaoqian, Hu, Chi, Zeng, Xin, Xiao, Tong, Ma, Anxiang, Wang, Huizhen, Zhu, JingBo

May-26-2023–arXiv.org Artificial Intelligence

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose \textit{Progressive Down-Sampling} (PDS) which gradually compresses the acoustic features into coarser-grained units containing more complete semantic information, like text-level representation. In addition, we develop a representation fusion method to alleviate information loss that occurs inevitably during high compression. In this way, we compress the acoustic features into 1/32 of the initial length while achieving better or comparable performances on the speech recognition task. And as a bonus, it yields inference speedups ranging from 1.20$\times$ to 1.47$\times$. By reducing the modeling burden, we also achieve competitive results when training on the more challenging speech translation task.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-26-2023

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia
  - Queensland > Brisbane (0.04)
- North America
  - United States
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
  - Canada > Alberta
    - Census Division No. 6 > Calgary Metropolitan Region > Calgary (0.04)
- Europe
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Austria > Styria
    - Graz (0.05)
- Asia
  - India (0.04)
  - Singapore (0.04)
  - South Korea > Seoul
    - Seoul (0.04)
  - China
    - Shanghai > Shanghai (0.05)
    - Liaoning Province > Shenyang (0.04)
    - Gansu Province > Lanzhou (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Speech > Speech Recognition (0.89)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found