image transformer
Simple, Distributed, and Accelerated Probabilistic Programming
Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.
LiT-4-RSVQA: Lightweight Transformer-based Visual Question Answering in Remote Sensing
Hackel, Leonard, Clasen, Kai Norman, Ravanbakhsh, Mahdyar, Demir, Begüm
Visual question answering (VQA) methods in remote sensing (RS) aim to answer natural language questions with respect to an RS image. Most of the existing methods require a large amount of computational resources, which limits their application in operational scenarios in RS. To address this issue, in this paper we present an effective lightweight transformer-based VQA in RS (LiT-4-RSVQA) architecture for efficient and accurate VQA in RS. Our architecture consists of: i) a lightweight text encoder module; ii) a lightweight image encoder module; iii) a fusion module; and iv) a classification module. The experimental results obtained on a VQA benchmark dataset demonstrate that our proposed LiT-4-RSVQA architecture provides accurate VQA results while significantly reducing the computational requirements on the executing hardware. Our code is publicly available at https://git.tu-berlin.de/rsim/lit4rsvqa.
Detecting Severity of Diabetic Retinopathy from Fundus Images using Ensembled Transformers
Adak, Chandranath, Karkera, Tejas, Chattopadhyay, Soumi, Saqib, Muhammad
Diabetic Retinopathy (DR) is considered one of the primary concerns due to its effect on vision loss among most people with diabetes globally. The severity of DR is mostly comprehended manually by ophthalmologists from fundus photography-based retina images. This paper deals with an automated understanding of the severity stages of DR. In the literature, researchers have focused on this automation using traditional machine learning-based algorithms and convolutional architectures. However, the past works hardly focused on essential parts of the retinal image to improve the model performance. In this paper, we adopt transformer-based learning models to capture the crucial features of retinal images to understand DR severity better. We work with ensembling image transformers, where we adopt four models, namely ViT (Vision Transformer), BEiT (Bidirectional Encoder representation for image Transformer), CaiT (Class-Attention in Image Transformers), and DeiT (Data efficient image Transformers), to infer the degree of DR severity from fundus photographs. For experiments, we used the publicly available APTOS-2019 blindness detection dataset, where the performances of the transformer-based models were quite encouraging.
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.95)
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models
Li, Minghao, Lv, Tengchao, Chen, Jingye, Cui, Lei, Lu, Yijuan, Florencio, Dinei, Zhang, Cha, Li, Zhoujun, Wei, Furu
Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url{https://aka.ms/trocr}.
- Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Vision Transformers: Natural Language Processing (NLP) Increases Efficiency and Model Generality
There has been no shortage of developments vying for a share of your attention over the last year or so. However, if you regularly follow the state of machine learning research you may recall a loud contender for a share of your mind in OpenAI's GPT-3 and accompanying business strategy development from the group. GPT-3 is the latest and by far the largest in OpenAI's general purpose transformer lineage working on models for natural language processing. Of course, GPT-3 and GPTs may grab headlines, but it belongs to a much larger superfamily of transformer models, including a plethora of variants based on the Bidirectional Encoder Representations from Transformers (BERT) family originally created by Google, as well as other smaller families of models from Facebook and Microsoft. For an expansive but still not exhaustive overview of major NLP transformers, the leading resource is probably the Apache 2.0 licensed Hugging Face () library.
Simple, Distributed, and Accelerated Probabilistic Programming
Tran, Dustin, Hoffman, Matthew W., Moore, Dave, Suter, Christopher, Vasudevan, Srinivas, Radul, Alexey
Our lightweight implementation in TensorFlow enables numerous applications: a model-parallel variational auto-encoder (VAE) with 2nd-generation tensor processing units (TPUv2s); a data-parallel autoregressive model (Image Transformer) with TPUv2s; and multi-GPU No-U-Turn Sampler (NUTS). For both a state-of-the-art VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebA-HQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3. Papers published at the Neural Information Processing Systems Conference.