End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders