DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions

Neural Information Processing Systems 

To answer this question, we begin by revisiting the forward procedure of ViTs. A sequence of positional embeddings (PEs) [51] is added to patch embeddings to preserve position information. Intuitively, simply discarding these PEs and requesting the model to reconstruct the position for each patch naturally becomes a qualified location-aware pretext task.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found