Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better