image recognition task
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability. In the inter-scale finetuning stage, the model introduces scale-exchanged data, featuring images and documents at resolutions different from the most commonly used ones, to enhance its scale robustness. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment. Experiments across multiple benchmarks confirm that our method significantly outperforms existing methods on document-related tasks (e.g., OCR and DocQA) while maintaining the performances on natural images, demonstrating its ability to substantially enhance text recognition without compromising its core image recognition capabilities.
Fast AutoAugment
Data augmentation is an essential technique for improving generalization ability of deep learning models. Recently, AutoAugment \cite{cubuk2018autoaugment} has been proposed as an algorithm to automatically search for augmentation policies from a dataset and has significantly enhanced performances on many image recognition tasks. However, its search method requires thousands of GPU hours even for a relatively small dataset. In this paper, we propose an algorithm called Fast AutoAugment that finds effective augmentation policies via a more efficient search strategy based on density matching. In comparison to AutoAugment, the proposed algorithm speeds up the search time by orders of magnitude while achieves comparable performances on image recognition tasks with various models and datasets including CIFAR-10, CIFAR-100, SVHN, and ImageNet.
University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system
Taniguchi, Takara, Ueda, Yudai, Muramatsu, Atsuya, Hashimoto, Kohki, Yagi, Ryo, Ochiai, Hideya, Aswakul, Chaodit
Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in https://github.com/jo2lxq/wafl/.
- Banking & Finance (0.69)
- Information Technology > Security & Privacy (0.69)
UNIT: Unifying Image and Text Recognition in One Vision Encoder
Currently, vision encoder models like Vision Transformers (ViTs) typically excel at image recognition tasks but cannot simultaneously support text recognition like human visual recognition. To address this limitation, we propose UNIT, a novel training framework aimed at UNifying Image and Text recognition within a single model. Starting with a vision encoder pre-trained with image recognition tasks, UNIT introduces a lightweight language decoder for predicting text outputs and a lightweight vision decoder to prevent catastrophic forgetting of the original image encoding capabilities. The training process comprises two stages: intra-scale pretraining and inter-scale finetuning. During intra-scale pretraining, UNIT learns unified representations from multi-scale inputs, where images and documents are at their commonly used resolution, to enable fundamental recognition capability.
Compact and Efficient Neural Networks for Image Recognition Based on Learned 2D Separable Transform
Vashkevich, Maxim, Krivalcevich, Egor
The paper presents a learned two-dimensional separable transform (LST) that can be considered as a new type of computational layer for constructing neural network (NN) architecture for image recognition tasks. The LST based on the idea of sharing the weights of one fullyconnected (FC) layer to process all rows of an image. After that, a second shared FC layer is used to process all columns of image representation obtained from the first layer. The use of LST layers in a NN architecture significantly reduces the number of model parameters compared to models that use stacked FC layers. We show that a NN-classifier based on a single LST layer followed by an FC layer achieves 98.02\% accuracy on the MNIST dataset, while having only 9.5k parameters. We also implemented a LST-based classifier for handwritten digit recognition on the FPGA platform to demonstrate the efficiency of the suggested approach for designing a compact and high-performance implementation of NN models. Git repository with supplementary materials: https://github.com/Mak-Sim/LST-2d
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
- Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition > Image Matching (0.62)
Janus: Collaborative Vision Transformer Under Dynamic Network Environment
Jiang, Linyi, Fu, Silvery D., Zhu, Yifei, Li, Bo
Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Network architectures and achieved state-of-the-art results in various computer vision tasks. Since ViTs are computationally expensive, the models either have to be pruned to run on resource-limited edge devices only or have to be executed on remote cloud servers after receiving the raw data transmitted over fluctuating networks. The resulting degraded performance or high latency all hinder their widespread applications. In this paper, we present Janus, the first framework for low-latency cloud-device collaborative Vision Transformer inference over dynamic networks. Janus overcomes the intrinsic model limitations of ViTs and realizes collaboratively executing ViT models on both cloud and edge devices, achieving low latency, high accuracy, and low communication overhead. Specifically, Janus judiciously combines token pruning techniques with a carefully designed fine-to-coarse model splitting policy and non-static mixed pruning policy. It attains a balance between accuracy and latency by dynamically selecting the optimal pruning level and split point. Experimental results across various tasks demonstrate that Janus enhances throughput by up to 5.15 times and reduces latency violation ratios by up to 98.7% when compared with baseline approaches under various network environments.
Fast AutoAugment
Data augmentation is an essential technique for improving generalization ability of deep learning models. Recently, AutoAugment \cite{cubuk2018autoaugment} has been proposed as an algorithm to automatically search for augmentation policies from a dataset and has significantly enhanced performances on many image recognition tasks. However, its search method requires thousands of GPU hours even for a relatively small dataset. In this paper, we propose an algorithm called Fast AutoAugment that finds effective augmentation policies via a more efficient search strategy based on density matching. In comparison to AutoAugment, the proposed algorithm speeds up the search time by orders of magnitude while achieves comparable performances on image recognition tasks with various models and datasets including CIFAR-10, CIFAR-100, SVHN, and ImageNet.
Reviews: Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
The authors provide a clear and succinct introduction to the problems and approaches of biologically plausible forms of backprop in the brain. They argue for behavioural realism apart from physiological realism and undertake a detailed comparison of backprop versus difference target prop and its variants (some of which they newly propose) and also direct feedback alignment. In the end though, they find that all proposed forms of bio-plausible alternatives to backprop fall quite short on complex image recognition tasks. Despite the negative results, I find such a comparison very timely to consolidate results and push the community to search for better and more diverse alternatives. Overall I find the work impressive. The authors claim that weight sharing is not plausible in the brain.
Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
Groot, Tobias, Valdenegro-Toro, Matias
Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.30)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.05)
- (6 more...)
Deep Learning for Image Recognition: An Overview of Convolutional Neural Networks
Deep Learning is a subset of Machine Learning which has been proven to be very effective in solving complex problems such as image recognition, natural language processing, and speech recognition. In this article, we will focus on the application of deep learning in image recognition and specifically on Convolutional Neural Networks (CNNs) which are a type of deep learning algorithm that has been highly successful in image recognition tasks. A CNN is a type of neural network that is designed to work with image data. It is composed of multiple layers, with the first layer typically being a convolutional layer that is responsible for detecting low-level features in the image such as edges and textures. The next layers are pooling layers that are responsible for reducing the dimensionality of the feature maps and increasing the robustness of the network.