Modality Selection and Skill Segmentation via Cross-Modality Attention