satmae
01c561df365429f33fcd7a7faa44c985-Supplemental-Conference.pdf
A.1 Datasets fMoWRGBFunctional Map of the World (fMoW) [17] is a dataset of high-resolution satellite image time series across the world, with a task of classification among 62 architecture categories such as airport, shipyard, and zoo. The license is provided here 2. Co-located images of different timestamps, or sequences, are provided in fMoW. They are of different length, and around 60% of the samples have length larger than 2. Readers can refer to the fMoW paper [17] for statistics on the distribution of sequence lengths. We construct a temporal version of fMoW by randomly associating every single image with two images of the same location but of different timestamps if possible. For a given spatial location loc, we define Tloc as the number of temporally distinct snapshots present in the dataset. We crop surface reflectance images from the Sentinel-2 (ESA) satellite (courtesy of the U.S. Geological Survey), consisting of 90-day composites of images at the same locations as fMoW images (to reduce the impacts of cloud coverage). At each fMoW datapoint location, we collect a time series of Sentinel-2 images, using the provided geo-coordinate bounding boxes.
SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to $\uparrow$ 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to $\uparrow$ 14%) and semantic segmentation.
SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial.
KidSat: satellite imagery to map childhood poverty dataset and benchmark
Sharma, Makkunda, Yang, Fan, Vo, Duy-Nhat, Suel, Esra, Mishra, Swapnil, Bhatt, Samir, Fiala, Oliver, Rudgard, William, Flaxman, Seth
Satellite imagery has emerged as an important tool to analyse demographic, health, and development indicators. While various deep learning models have been built for these tasks, each is specific to a particular problem, with few standard benchmarks available. We propose a new dataset pairing satellite imagery and high-quality survey data on child poverty to benchmark satellite feature representations. Our dataset consists of 33,608 images, each 10 km $\times$ 10 km, from 19 countries in Eastern and Southern Africa in the time period 1997-2022. As defined by UNICEF, multidimensional child poverty covers six dimensions and it can be calculated from the face-to-face Demographic and Health Surveys (DHS) Program . As part of the benchmark, we test spatial as well as temporal generalization, by testing on unseen locations, and on data after the training years. Using our dataset we benchmark multiple models, from low-level satellite imagery models such as MOSAIKS , to deep learning foundation models, which include both generic vision models such as Self-Distillation with no Labels (DINOv2) models and specific satellite imagery models such as SatMAE. We provide open source code for building the satellite dataset, obtaining ground truth data from DHS and running various models assessed in our work.
A Appendix
A.1 Datasets fMoW RGB Functional Map of the World (fMoW) [17] is a dataset of high-resolution satellite image time series across the world, with a task of classification among 62 architecture categories such as airport, shipyard, and zoo. They are of different length, and around 60% of the samples have length larger than 2. Readers can refer to the fMoW paper [17] for statistics on the distribution of sequence lengths. We construct a temporal version of fMoW by randomly associating every single image with two images of the same location but of different timestamps if possible. We crop surface reflectance images from the Sentinel-2 (ESA) satellite (courtesy of the U.S. Geological Survey), consisting of 90-day composites of images at the same locations as fMoW images (to reduce the impacts of cloud coverage). At each fMoW datapoint location, we collect a time series of Sentinel-2 images, using the provided geo-coordinate bounding boxes. For locations where all fMoW images are before the Sentinel-2 time range, we discard the location.
SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery
Unsupervised pre-training methods for large vision models have shown to enhance performance on downstream supervised tasks. Developing similar techniques for satellite imagery presents significant opportunities as unlabelled data is plentiful and the inherent temporal and multi-spectral structure provides avenues to further improve existing pre-training strategies. In this paper, we present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE). To leverage temporal information, we include a temporal embedding along with independently masking image patches across time. In addition, we demonstrate that encoding multi-spectral data as groups of bands with distinct spectral positional encodings is beneficial. Our approach yields strong improvements over previous state-of-the-art techniques, both in terms of supervised learning performance on benchmark datasets (up to " 7%), and transfer learning performance on downstream remote sensing tasks, including land cover classification (up to " 14%) and semantic segmentation.