Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency