FeatUp: A Model-Agnostic Framework for Features at Any Resolution
Fu, Stephanie, Hamilton, Mark, Brandt, Laura, Feldman, Axel, Zhang, Zhoutong, Freeman, William T.
–arXiv.org Artificial Intelligence
High-res features can be learned either as a per-image implicit network or a general-purpose upsampling operation; the latter is a drop-in module to improve downstream dense prediction tasks. Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero-or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task-and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation. Despite their immense success, deep features often sacrifice spatial resolution for semantic quality. For example, ResNet-50 (He et al., 2015) produces 7 7 deep Work done while at MIT. FeatUp learns to upsample features through a consistency loss on low resolution "views" of a model's features that arise from slight transformations of the input image. Even Vision Transformers (ViTs) (Dosovitskiy et al., 2020) incur a significant resolution reduction, making it challenging to perform dense prediction tasks such as segmentation or depth estimation using these features alone. To mitigate these issues, we propose FeatUp: a novel framework to improve the resolution of any vision model's features without changing their original "meaning" or orientation. Our primary insight, inspired by 3D reconstruction frameworks like NeRF (Mildenhall et al., 2020), is that multiview consistency of low-resolution signals can supervise the construction of high-resolution signals. More specifically, we learn high-resolution information by aggregating low resolution views from a model's outputs across multiple "jittered" (e.g. Our work explores two architectures for upsampling: a single guided upsampling feedforward network that generalizes across images, and an implicit representation overfit to a single image.
arXiv.org Artificial Intelligence
Apr-1-2024
- Country:
- Asia > Middle East
- Israel (0.14)
- North America > United States (1.00)
- Asia > Middle East
- Genre:
- Research Report (1.00)
- Industry:
- Technology: