Using Transformer models has never been simpler! Yes that's what Simple Transformers author Thilina Rajapakse says and I agree with him so should you. You might have seen lengthy code with hundreds of lines to implement transformers models such as BERT, RoBERTa, etc. Once you understand how to use Simple Transformers you will know how easy and simple it is to use transformer models. TheSimple Transformers library is built on top of Hugging Face Transformers library. Hugging Face Transformers provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, T5, etc.) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) and provides more than thousand pre-trained models and covers around 100 languages.
Imagine going to your local hardware store and seeing a new kind of hammer on the shelf. You've heard about this hammer: It pounds faster and more accurately than others, and in the last few years it's rendered many other hammers obsolete, at least for most uses. With a few tweaks -- an attachment here, a twist there -- the tool changes into a saw that can cut at least as fast and as accurately as any other option out there. In fact, some experts at the frontiers of tool development say this hammer might just herald the convergence of all tools into a single device. A similar story is playing out among the tools of artificial intelligence.
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.
GasHis-Transformer is a model for realizing gastric histopathological image classification (GHIC), which automatically classifies microscopic images of the stomach into normal and abnormal cases in gastric cancer diagnosis, as shown in the figure. GasHis-Transformer is a multi-scale image classification model that combines the best features of Vision Transformer (ViT) and CNN, where ViT is good for global information and CNN is good for local information. GasHis-Transformer consists of two important modules, Global Information Module ( GIM) and Local Information Module ( LIM), as shown in the figure below. GasHisTransformer has high classification performance on the test data of gastric histopathology dataset, with estimate precision, recall, F1-score, and accuracy of 98.0%, 100.0%, 96.0%, and 98.0%, respectively. GasHisTransformer consists of two modules: Global Information Module (GIM) and Local Information Module (LIM).
Even though transformers Vaswani et al. , Devlin et al.  have become the state-of-the-art and at par with humans for several natural language processing (NLP) tasks, their applications in vision has been severely limited by their quadratic complexity with respect to sequence length. Even low resolution images, when unrolled, become long 1D sequences of tens of thousands of pixels, and impose a large computational and memory burden on a GPU. A transformer, being a general architecture without an inductive prior, also requires a large number of training images for giving good generalization compared to convolutional models. It also needs extra architectural changes, including the addition of positional embeddings, to gather the positional information of various image pixels. This demand for large amount of data and GPU resources is not suitable for resource-constrained scenarios where data and GPU capabilities are limited, such as green or edge computing Khan et al. . On the other hand, CNNs have the inductive priors, such as translational equivariance due to convolutional weight sharing and partial scale invariance due to pooling, to handle 2D images which enables them to learn from smaller datasets with less computational expenditure. But, they fail to capture long range dependencies compared to transformers and require deeper networks with several layers to increase their receptive fields. Combining the efficiency and inductive priors of CNNs with the long range information capturing ability of attention can create better architectures that are suitable for computer vision applications.