Perceptrons
Vision Xformers: Efficient Attention for Image Classification
We propose three improvements to vision transformers (ViT) to reduce the number of trainable parameters without compromising classification accuracy. We address two shortcomings of the early ViT architectures -- quadratic bottleneck of the attention mechanism and the lack of an inductive bias in their architectures that rely on unrolling the two-dimensional image structure. Linear attention mechanisms overcome the bottleneck of quadratic complexity, which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers, such as Performer, Linformer and Nystr\"omformer of linear complexity creating Vision X-formers (ViX). We show that all three versions of ViX may be more accurate than ViT for image classification while using far fewer parameters and computational resources. We also compare their performance with FNet and multi-layer perceptron (MLP) mixer. We further show that replacing the initial linear embedding layer by convolutional layers in ViX further increases their performance. Furthermore, our tests on recent vision transformer models, such as LeViT, Convolutional vision Transformer (CvT), Compact Convolutional Transformer (CCT) and Pooling-based Vision Transformer (PiT) show that replacing the attention with Nystr\"omformer or Performer saves GPU usage and memory without deteriorating the classification accuracy. We also show that replacing the standard learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE) give further improvements in accuracy. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.
Applications of Artificial Neural Networks in Microorganism Image Analysis: A Comprehensive Review from Conventional Multilayer Perceptron to Popular Convolutional Neural Network and Potential Visual Transformer
Zhang, Jinghua, Li, Chen, Grzegorzek, Marcin
Microorganisms are widely distributed in the human daily living environment. They play an essential role in environmental pollution control, disease prevention and treatment, and food and drug production. The identification, counting, and detection are the basic steps for making full use of different microorganisms. However, the conventional analysis methods are expensive, laborious, and time-consuming. To overcome these limitations, artificial neural networks are applied for microorganism image analysis. We conduct this review to understand the development process of microorganism image analysis based on artificial neural networks. In this review, the background and motivation are introduced first. Then, the development of artificial neural networks and representative networks are introduced. After that, the papers related to microorganism image analysis based on classical and deep neural networks are reviewed from the perspectives of different tasks. In the end, the methodology analysis and potential direction are discussed.
Provably Efficient Lottery Ticket Discovery
Wolfe, Cameron R., Wang, Qihan, Kim, Junhyung Lyle, Kyrillidis, Anastasios
The lottery ticket hypothesis (LTH) [19] claims that randomly-initialized, dense neural networks contain (sparse) subnetworks that, when trained an equal amount in isolation, can match the dense network's performance. Although LTH is useful for discovering efficient network architectures, its three-step process--pre-training, pruning, and re-training--is computationally expensive, as the dense model must be fully pre-trained. Luckily, "early-bird" tickets can be discovered within neural networks that are minimally pre-trained [67], allowing for the creation of efficient, LTH-inspired training procedures. Yet, no theoretical foundation of this phenomenon exists. We derive an analytical bound for the number of pre-training iterations that must be performed for a winning ticket to be discovered, thus providing a theoretical understanding of when and why such early-bird tickets exist. By adopting a greedy forward selection pruning strategy [65], we directly connect the pruned network's performance to the loss of the dense network from which it was derived, revealing a threshold in the number of pre-training iterations beyond which high-performing subnetworks are guaranteed to exist. We demonstrate the validity of our theoretical results across a variety of architectures and datasets, including multi-layer perceptrons (MLPs) trained on MNIST and several deep convolutional neural network (CNN) architectures trained on CIFAR10 and ImageNet.
Multi-modal Residual Perceptron Network for Audio-Video Emotion Recognition
Chang, Xin, Skarbek, Władysław
Audio-Video Emotion Recognition is now attacked with Deep Neural Network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases superiority in uni-modality can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better performance in the existing late fusion and end-to-end multi-modal network training strategies. To take advantage and overcome the deficiencies in both solutions, we define a Multi-modal Residual Perceptron Network which performs end-to-end learning from multi-modal network branches, generalizing better multi-modal feature representation. For the proposed Multi-modal Residual Perceptron Network and the novel time augmentation for streaming digital movies, the state-of-art average recognition rate was improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and Song dataset and to 83.15% for Crowd-sourced Emotional multi-modal Actors dataset. Moreover, the Multi-modal Residual Perceptron Network concept shows its potential for multi-modal applications dealing with signal sources not only of optical and acoustical types.
Vision Transformers or Convolutional Neural Networks? Both!
Through the use of filters, these networks are able to generate simplified versions of the input image by creating feature maps that highlight the most relevant parts. These features are then used by a multi-layer perceptron to perform the desired classification. But recently this field has been incredibly revolutionized by the architecture of Vision Transformers (ViT), which through the mechanism of self-attention has proven to obtain excellent results on many tasks. In this article some basic aspects of Vision Transformers will be taken for granted, if you want to go deeper into the subject I suggest you read my previous overview of the architecture. Although Transformers have proven to be excellent replacements for CNNs, there is an important constraint that makes their application rather challenging, the need for large datasets.
From Common Sense Reasoning to Neural Network Models through Multiple Preferences: an overview
Giordano, Laura, Gliozzi, Valentina, Dupré, Daniele Theseider
In this paper we discuss the relationships between conditional and preferential logics and neural network models, based on a multi-preferential semantics. We propose a concept-wise multipreference semantics, recently introduced for defeasible description logics to take into account preferences with respect to different concepts, as a tool for providing a semantic interpretation to neural network models. This approach has been explored both for unsupervised neural network models (Self-Organising Maps) and for supervised ones (Multilayer Perceptrons), and we expect that the same approach might be extended to other neural network models. It allows for logical properties of the network to be checked (by model checking) over an interpretation capturing the input-output behavior of the network. For Multilayer Perceptrons, the deep network itself can be regarded as a conditional knowledge base, in which synaptic connections correspond to weighted conditionals. The paper describes the general approach, through the cases of Self-Organising Maps and Multilayer Perceptrons, and discusses some open issues and perspectives.
Total Nitrogen Estimation in Agricultural Soils via Aerial Multispectral Imaging and LIBS
Measuring soil health indicators is an important and challenging task that affects farmers' decisions on timing, placement, and quantity of fertilizers applied in the farms. Most existing methods to measure soil health indicators (SHIs) are in-lab wet chemistry or spectroscopy-based methods, which require significant human input and effort, time-consuming, costly, and are low-throughput in nature. To address this challenge, we develop an artificial intelligence (AI)-driven near real-time unmanned aerial vehicle (UAV)-based multispectral sensing (UMS) solution to estimate total nitrogen (TN) of the soil, an important macro-nutrient or SHI that directly affects the crop health. Accurate prediction of soil TN can significantly increase crop yield through informed decision making on the timing of seed planting, and fertilizer quantity and timing. We train two machine learning models including multi-layer perceptron and support vector machine to predict the soil nitrogen using a suite of data classes including multispectral characteristics of the soil and crops in red, near-infrared, and green spectral bands, computed vegetation indices, and environmental variables including air temperature and relative humidity.
A Leap among Entanglement and Neural Networks: A Quantum Survey
Massoli, Fabio Valerio, Vadicamo, Lucia, Amato, Giuseppe, Falchi, Fabrizio
Perhaps, the modern definition of AI - as the ensemble of computer systems empowered with the ability to learn from data through statistical techniques - can be dated back to 1959. Machine Learning (ML), a subclass of AI, is a discipline that aims at studying algorithms that are able to learn from experience and data to perform tasks without following explicit instructions. Often, these algorithms are based on a computational model that belongs to differentiable programming techniques, called Neural Networks (NNs). The success of such algorithms resides in their ability to learn to achieve a specific goal [95, 121], i.e., they learn to discover hidden patterns and relations among data to fulfill the task at hand [89, 120]. Mathematically, NNs are made of a sequence of transformations, called layers, composed of affine operators and elementwise nonlinearities. Then, the goal of learning is to modify the transformations' parameters to fulfill a task successfully. Whenever a model accounts for more than a couple of such layers, it is called a Deep Learning (DL) model or a Deep Neural Network (DNN). Thanks to their enormous representation power and the development of new technologies and training algorithms, DL models obtained astonishing results in the last two decades, achieving superhuman performance on certain tasks [177].
Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling
Hu, Qingyong, Yang, Bo, Xie, Linhai, Rosa, Stefano, Guo, Yulan, Wang, Zhihua, Trigoni, Niki, Markham, Andrew
Abstract--We study the problem of efficient semantic segmentation of large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Comparative experiments show that our RandLA-Net can process 1 million points in a single pass up to 200 faster than existing approaches. Moreover, extensive experiments on five large-scale point cloud datasets, including Semantic3D, SemanticKITTI, Toronto3D, NPM3D and S3DIS, demonstrate the state-of-the-art semantic segmentation performance of our RandLA-Net. A key challenge is that the raw point clouds acquired by depth sensors are typically irregularly sampled, unstructured and unordered. Recently, the pioneering work PointNet [4] has emerged as a promising approach for directly processing 3D point clouds. It learns per-point features using shared multilayer perceptrons (MLPs). This is computationally efficient but fails to capture wider context information for each point.