SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency