Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

Open in new window