Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets

Hurt, J. Alex, Bajkowski, Trevor M., Scott, Grant J., Davis, Curt H.

arXiv.org Artificial Intelligence 

In 2012, AlexNet established deep convolutional neural networks (DCNNs) as the state-of-the-art in CV, as these networks soon led in visual tasks for many domains, including remote sensing. With the publication of Visual Transformers, we are witnessing the second modern leap in computational vision, and as such, it is imperative to understand how various transformer-based neural networks perform on satellite imagery. While transformers have shown high levels of performance in natural language processing and CV applications, they have yet to be compared on a large scale to modern remote sensing data. In this paper, we explore the use of transformer-based neural networks for object detection in high-resolution electro-optical satellite imagery, demonstrating state-of-the-art performance on a variety of publicly available benchmark data sets. We compare eleven distinct bounding-box detection and localization algorithms in this study, of which seven were published since 2020, and all eleven since 2015. The performance of five transformer-based architectures is compared with six convolutional networks on three state-of-the-art open-source high-resolution remote sensing imagery datasets ranging in size and complexity. Following the training and evaluation of thirty-three deep neural models, we then discuss and analyze model performance across various feature extraction methodologies and detection algorithms. Machine learning and computer vision (CV) have seen the incredible rise of deep neural networks (DNNs), particularly convolutional neural networks, since the original AlexNet [1] paper in 2012. Combined with the processing power of GPUs and the increasing availability of robust pre-trained weights derived from massive image-sets for techniques like transfer learning, the ability to effectively train deep models has in turn led to DNNs becoming the most widely used CV technique. In recent years, however, the convolutional feature extractors that have long been the foundation of these DNNs have been outperformed on CV challenge datasets such as the ImageNet [2] and COCO [3] competitions by a newer feature extraction architecture, known as transformers. Convolutional-based deep neural networks have historically shown outstanding performance in CV applications, however, the recent publication of Visual Transformer Neural Networks, beginning with the original Vision Transformer (ViT) [4], has enabled a leap in computational vision capabilities. Following the publication of ViT, visual transformer architectures have been found to be capable of outperforming traditional convolutional networks for a variety of CV applications.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found