MATER: Multi-level Acoustic and Textual Emotion Representation for Interpretable Speech Emotion Recognition