PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Weber, Manuel, Beneke, Carly

arXiv.org Artificial Intelligence 

A BSTRACT We propose PyViT -FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwA V algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks. 1 I NTRODUCTION Foundation models (FM) for earth observations (EO) have gained traction following the success of large language models (LLM) and their demonstration of scaling laws (Kaplan et al., 2020). The premise is that training larger models on vast datasets enhances performance. This idea has been central to computer vision, where datasets like ImageNet (Deng et al., 2009) have enabled pre-training in both supervised and unsupervised settings, leading to breakthroughs in model design and training.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found