The Conformer Encoder May Reverse the Time Dimension

Schmitt, Robin, Zeyer, Albert, Zeineldeen, Mohammad, Schlüter, Ralf, Ney, Hermann

Oct-1-2024–arXiv.org Machine Learning

We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.

computational linguistic, sequence, speech recognition, (14 more...)

arXiv.org Machine Learning

Oct-1-2024

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America > United States
  - Hawaii > Honolulu County
    - Honolulu (0.04)
  - California > Los Angeles County
    - Los Angeles (0.14)
- Europe
  - Greece (0.04)
  - Belgium (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy
    - Tuscany > Florence (0.04)
    - Marche > Ancona Province
      - Ancona (0.04)
  - Germany
    - Berlin (0.04)
    - North Rhine-Westphalia > Cologne Region
      - Aachen (0.04)
- Asia
  - South Korea > Seoul
    - Seoul (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)
  - India > Telangana
    - Hyderabad (0.04)

Genre:
- Research Report (0.85)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Speech (0.97)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)