Weakly-supervised Latent Models for Task-specific Visual-Language Control