A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots

Open in new window