DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning