CROMA: Remote Sensing Representations with Contrastive Radar-Optical Masked Autoencoders

Dec-23-2025, 22:57:28 GMT–Neural Information Processing Systems

A vital and rapidly growing application, remote sensing offers vast yet sparsely labeled, spatially aligned multimodal data; this makes self-supervised learning algorithms invaluable. We present CROMA: a framework that combines contrastive and reconstruction self-supervised objectives to learn rich unimodal and multimodal representations. Our method separately encodes masked-out multispectral optical and synthetic aperture radar samples--aligned in space and time--and performs cross-modal contrastive learning. Another encoder fuses these sensors, producing joint multimodal encodings that are used to predict the masked patches via a lightweight decoder. We show that these objectives are complementary when leveraged on spatially aligned multimodal data. We also introduce X-and 2D-ALiBi, which spatially biases our cross-and self-attention matrices. These strategies improve representations and allow our models to effectively extrapolate to images up to $17.6\times$ larger at test-time.

contrastive radar-optical masked autoencoder, remote sensing representation, uparrow, (10 more...)

Neural Information Processing Systems

Dec-23-2025, 22:57:28 GMT

Conferences Web Page

Add feedback

Industry:
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.69)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)