A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots