PyViT-FUSE: A Foundation Model for Multi-Sensor Earth Observation Data

Apr-29-2025–arXiv.org Artificial Intelligence

A BSTRACT We propose PyViT -FUSE, a foundation model for earth observation data explicitly designed to handle multi-modal imagery by learning to fuse an arbitrary number of mixed-resolution input bands into a single representation through an attention mechanism. The learned patch tokens are further processed by a stack of vision transformers with a novel pyramidal structure. We train the model on a globally sampled dataset in a self-supervised manner, leveraging core concepts of the SwA V algorithm. We show the interpretability of the fusion mechanism by visualization of the attention scores and the models applicability to downstream tasks. 1 I NTRODUCTION Foundation models (FM) for earth observations (EO) have gained traction following the success of large language models (LLM) and their demonstration of scaling laws (Kaplan et al., 2020). The premise is that training larger models on vast datasets enhances performance. This idea has been central to computer vision, where datasets like ImageNet (Deng et al., 2009) have enabled pre-training in both supervised and unsupervised settings, leading to breakthroughs in model design and training.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Apr-29-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.47)

Genre:
- Research Report (0.66)

Industry:
- Energy > Renewable > Solar (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.68)
  - Machine Learning
    - Neural Networks (0.94)
    - Statistical Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found