Can multimodal representation learning by alignment preserve modality-specific information?

Open in new window