Mother's Day is coming up, and online retailers are gearing up for its arrival with some tech deals. Currently, the Microsoft Store is selling the Asus Transformer Mini T102HA-C4-GR for $250. This two-in-one convertible PC has a detachable keyboard, allowing you to use it as either a Windows 10 tablet or a laptop. When we looked at this 2-in-1 in December, it delivered a pretty good experience overall, with great battery life but lower-end performance. That's in part due to the 1.44GHz Intel "Cherry Trail" Atom x5-Z8350 processor, which goes easier on power consumption but is one of Intel's slower CPUs.
The paper observes the theoretical background of why CNNs are efficient for modeling domains within vision. By taking an overview of the pipeline, we can find 4 unique building blocks in the diagram above. First, the input RGB image is split into patches by the patch partition layer. Each patch is 4 x 4 x 3(3 for RGB channels) and is considered a "token". The patch is subject to a linear embedding layer which projects it to a C dimensional token as in ViT.
Transformer is a type of neural network mainly based on self-attention mechanism . Transformer is widely used in the field of natural language processing (NLP), e.g., the famous BERT and GPT3 models. Inspired by the breakthrough of transformer in NLP, researchers have recently applied transformer to computer vision (CV) tasks such as image recognition, object detection, and image processing . For example, DETR treats object detection as a direct set prediction problem and solve it using a transformer encoder-decoder architecture. Compared to the mainstream CNN models, these transformer-based models have also shown promising performance on visual tasks .
We present a deep learning framework for accurate visual correspondences and demonstrate its effectiveness for both geometric and semantic matching, spanning across rigid motions to intra-class shape or appearance variations. In contrast to previous CNN-based approaches that optimize a surrogate patch similarity objective, we use deep metric learning to directly learn a feature space that preserves either geometric or semantic similarity. Our fully convolutional architecture, along with a novel correspondence contrastive loss allows faster training by effective reuse of computations, accurate gradient computation through the use of thousands of examples per image pair and faster testing with $O(n)$ feedforward passes for n keypoints, instead of $O(n^2)$ for typical patch similarity methods. We propose a convolutional spatial transformer to mimic patch normalization in traditional features like SIFT, which is shown to dramatically boost accuracy for semantic correspondences across intra-class shape variations. Extensive experiments on KITTI, PASCAL and CUB-2011 datasets demonstrate the significant advantages of our features over prior works that use either hand-constructed or learned features.