How to Build a Pre-trained Multimodal model for Simultaneously Chatting and Decision-making?