Goto

Collaborating Authors

 visual classification



Saccadic Vision for Fine-Grained Visual Classification

Schmidt, Johann, Stober, Sebastian, Denzler, Joachim, Bodesheim, Paul

arXiv.org Artificial Intelligence

Fine-grained visual classification (FGVC) requires distinguishing between visually similar categories through subtle, localized features - a task that remains challenging due to high intra-class variability and limited inter-class differences. Existing part-based methods often rely on complex localization networks that learn mappings from pixel to sample space, requiring a deep understanding of image content while limiting feature utility for downstream tasks. In addition, sampled points frequently suffer from high spatial redundancy, making it difficult to quantify the optimal number of required parts. Inspired by human saccadic vision, we propose a two-stage process that first extracts peripheral features (coarse view) and generates a sample map, from which fixation patches are sampled and encoded in parallel using a weight-shared encoder. We employ contextualized selective attention to weigh the impact of each fixation patch before fusing peripheral and focus representations. To prevent spatial collapse - a common issue in part-based methods - we utilize non-maximum suppression during fixation sampling to eliminate redundancy. Comprehensive evaluation on standard FGVC benchmarks (CUB-200-2011, NABirds, Food-101 and Stanford-Dogs) and challenging insect datasets (EU-Moths, Ecuador-Moths and AMI-Moths) demonstrates that our method achieves comparable performance to state-of-the-art approaches while consistently outperforming our baseline encoder.


Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification

Neural Information Processing Systems

Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification.


ARIA: On the interaction between Architectures, Aggregation methods and Initializations in federated visual classification

Siomos, Vasilis, Naval-Marimont, Sergio, Passerat-Palmbach, Jonathan, Tarroni, Giacomo

arXiv.org Artificial Intelligence

It's important to note IN pre-training restricts the input to 224x224 RGB images. When up-sampling of the original images Federated Learning (FL) is a collaborative training paradigm that is required to achieve that, it leads to a bigger than necessary allows for privacy-preserving learning of cross-institutional models computational and memory load, and the introduction of aliasing by eliminating the exchange of sensitive data and instead relying artifacts (e.g. Figure 1). When down-sampling is required instead, on the exchange of model parameters between the clients and a it can degrade performance. Hence, IN pre-training is not a silver server. Despite individual studies on how client models are aggregated, bullet, and benchmarking architectures and aggregation strategies and, more recently, on the benefits of ImageNet pre-training, without pre-training is also important. Furthermore, task-relevant there is a lack of understanding of the effect the architecture chosen pre-training through self-supervised learning (SSL) has recently for the federation has, and of how the aforementioned elements emerged as a highly-effective alternative to IN pre-training [9], but interconnect.


Representing visual classification as a linear combination of words

Agarwal, Shobhit, Semenov, Yevgeniy R., Lotter, William

arXiv.org Artificial Intelligence

Explainability is a longstanding challenge in deep learning, especially in high-stakes domains like healthcare. Common explainability methods highlight image regions that drive an AI model's decision. Humans, however, heavily rely on language to convey explanations of not only "where" but "what". Additionally, most explainability approaches focus on explaining individual AI predictions, rather than describing the features used by an AI model in general. The latter would be especially useful for model and dataset auditing, and potentially even knowledge generation as AI is increasingly being used in novel tasks. Here, we present an explainability strategy that uses a vision-language model to identify language-based descriptors of a visual classification task. By leveraging a pre-trained joint embedding space between images and text, our approach estimates a new classification task as a linear combination of words, resulting in a weight for each word that indicates its alignment with the vision-based classifier. We assess our approach using two medical imaging classification tasks, where we find that the resulting descriptors largely align with clinical knowledge despite a lack of domain-specific language training. However, our approach also identifies the potential for 'shortcut connections' in the public datasets used. Towards a functional measure of explainability, we perform a pilot reader study where we find that the AI-identified words can enable non-expert humans to perform a specialized medical task at a non-trivial level. Altogether, our results emphasize the potential of using multimodal foundational models to deliver intuitive, language-based explanations of visual tasks.


Adaptive Discriminative Regularization for Visual Classification

Zhao, Qingsong, Wang, Yi, Dou, Shuguang, Gong, Chen, Wang, Yin, Zhao, Cairong

arXiv.org Artificial Intelligence

How to improve discriminative feature learning is central in classification. Existing works address this problem by explicitly increasing inter-class separability and intra-class similarity, whether by constructing positive and negative pairs for contrastive learning or posing tighter class separating margins. These methods do not exploit the similarity between different classes as they adhere to i.i.d. assumption in data. In this paper, we embrace the real-world data distribution setting that some classes share semantic overlaps due to their similar appearances or concepts. Regarding this hypothesis, we propose a novel regularization to improve discriminative learning. We first calibrate the estimated highest likelihood of one sample based on its semantically neighboring classes, then encourage the overall likelihood predictions to be deterministic by imposing an adaptive exponential penalty. As the gradient of the proposed method is roughly proportional to the uncertainty of the predicted likelihoods, we name it adaptive discriminative regularization (ADR), trained along with a standard cross entropy loss in classification. Extensive experiments demonstrate that it can yield consistent and non-trivial performance improvements in a variety of visual classification tasks (over 10 benchmarks). Furthermore, we find it is robust to long-tailed and noisy label data distribution. Its flexible design enables its compatibility with mainstream classification architectures and losses.


EEG-based Image Feature Extraction for Visual Classification using Deep Learning

Mishra, Alankrit, Raj, Nikhil, Bajwa, Garima

arXiv.org Artificial Intelligence

While capable of segregating visual data, humans take time to examine a single piece, let alone thousands or millions of samples. The deep learning models efficiently process sizeable information with the help of modern-day computing. However, their questionable decision-making process has raised considerable concerns. Recent studies have identified a new approach to extract image features from EEG signals and combine them with standard image features. These approaches make deep learning models more interpretable and also enables faster converging of models with fewer samples. Inspired by recent studies, we developed an efficient way of encoding EEG signals as images to facilitate a more subtle understanding of brain signals with deep learning models. Using two variations in such encoding methods, we classified the encoded EEG signals corresponding to 39 image classes with a benchmark accuracy of 70% on the layered dataset of six subjects, which is significantly higher than the existing work. Our image classification approach with combined EEG features achieved an accuracy of 82% compared to the slightly better accuracy of a pure deep learning approach; nevertheless, it demonstrates the viability of the theory.


Improving Few-Shot Visual Classification with Unlabelled Examples

Bateni, Peyman, Barber, Jarred, van de Meent, Jan-Willem, Wood, Frank

arXiv.org Machine Learning

We propose a transductive meta-learning method that uses unlabelled instances to improve few-shot image classification performance. Our approach combines a regularized Mahalanobis-distance-based soft k-means clustering procedure with a modified state of the art neural adaptive feature extractor to achieve improved test-time classification accuracy using unlabelled data. We evaluate our method on transductive few-shot learning tasks, in which the goal is to jointly predict labels for query (test) examples given a set of support (training) examples. We achieve new state of the art performance on Meta-Dataset, and produce competitive results on mini-and tiered-ImageNet benchmarks. Deep learning has revolutionized visual classification, enabled in part by the development of large and diverse sets of curated training data (Szegedy et al., 2014; He et al., 2015; Krizhevsky et al., 2017; Simonyan & Zisserman, 2014; Sornam et al., 2017). However, in many image classification settings, millions of labelled examples are not available; therefore, techniques that can achieve sufficient classification performance with few labels are required. This has motivated research on few-shot learning (Feyjie et al., 2020; Wang & Yao, 2019; Wang et al., 2019; Bellet et al., 2013), which seeks to develop methods for developing classifiers with much smaller datasets.


Text Analysis vs Visual Classification (Part One) ...

#artificialintelligence

Text Analysis ("TA") and Visual Classification ("VC") take two different approaches to classifying documents. TA uses the text associated with the documents being classified while VC bases its analysis on graphical representations of those documents. TA is an outgrowth of tools designed to extract meaning from collections of textual content. VC was developed as an enterprise-scale information governance tool for completing document-centric initiatives like content migration, archive digitization, silo consolidation, and archive digitization. The different approaches and origins of TA and VC lead to major differences in awareness, comprehensiveness, transparency, repurposing work product, attribute extraction, redaction, and correcting document boundaries among other things.


Transfer learning using neon - Nervana

#artificialintelligence

In the last few years plenty of deep neural net (DNN) models have been made available for a variety of applications such as classification, image recognition and speech translation. Typically, each of these models are designed for a very specific purpose, but can be extended to novel use cases. For example, one can train a model to recognize numbers and characters in an image and then reuse that model to read signposts in a broader model or a dataset used in autonomous driving. Consider the task of visual classification. Convolutional neural networks (CNN) are organized into several layers with each layer learning features at a different scale.