Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views