Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts