Goto

Collaborating Authors

 Image Matching


OpenCLIP for Image Search and Automatic Captioning

#artificialintelligence

I have been using and writing about OpenAI's CLIP system since it came out in 2021 [1]. It consists of image and text encoding models that can be used for various forms of cross-modal comparison, like using a text query to find the best matching image in a library quickly. In December 2022, an independent group of researchers known as LAION released a paper called "Reproducible scaling laws for contrastive language-image learning" [2] that describes how they first reimplemented and trained a model similar to CLIP and then experimented with improving the system by training with a larger dataset and using new ML techniques. They call their new model OpenCLIP. In this article, I will provide some background info on the original CLIP, describe how LAION improved the model, and show some results from my experiments with the two systems using images from the Library of Congress's Flickr photostream.


Privacy Preserving Image Registration

arXiv.org Artificial Intelligence

Image registration is a key task in medical imaging applications, allowing to represent medical images in a common spatial reference frame. Current approaches to image registration are generally based on the assumption that the content of the images is usually accessible in clear form, from which the spatial transformation is subsequently estimated. This common assumption may not be met in practical applications, since the sensitive nature of medical images may ultimately require their analysis under privacy constraints, preventing to openly share the image content.In this work, we formulate the problem of image registration under a privacy preserving regime, where images are assumed to be confidential and cannot be disclosed in clear. We derive our privacy preserving image registration framework by extending classical registration paradigms to account for advanced cryptographic tools, such as secure multi-party computation and homomorphic encryption, that enable the execution of operations without leaking the underlying data. To overcome the problem of performance and scalability of cryptographic tools in high dimensions, we propose several techniques to optimize the image registration operations by using gradient approximations, and by revisiting the use of homomorphic encryption trough packing, to allow the efficient encryption and multiplication of large matrices. We demonstrate our privacy preserving framework in linear and non-linear registration problems, evaluating its accuracy and scalability with respect to standard, non-private counterparts. Our results show that privacy preserving image registration is feasible and can be adopted in sensitive medical imaging applications.


Iterative Patch Selection for High-Resolution Image Recognition

arXiv.org Artificial Intelligence

High-resolution images are prevalent in various applications, such as autonomous driving and computer-aided diagnosis. However, training neural networks on such images is computationally challenging and easily leads to out-of-memory errors even on modern GPUs. We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size and thus enables the processing of arbitrarily large images under tight hardware constraints. IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition. For both patch selection and aggregation, a cross-attention based transformer is introduced, which exhibits a close connection to Multiple Instance Learning. Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory. For example, we are able to finetune our model on whole-slide images consisting of up to 250k patches (>16 gigapixels) with only 5 GB of GPU VRAM at a batch size of 16. Image recognition has made great strides in recent years, spawning landmark architectures such as AlexNet (Krizhevsky et al., 2012) or ResNet (He et al., 2016). These networks are typically designed and optimized for datasets like ImageNet (Russakovsky et al., 2015), which consist of natural images well below one megapixel. In contrast, realworld applications often rely on high-resolution images that reveal detailed information about an object of interest. For example, in self-driving cars, megapixel images are beneficial to recognize distant traffic signs far in advance and react in time (Sahin, 2019). In medical imaging, a pathology diagnosis system has to process gigapixel microscope slides to recognize cancer cells, as illustrated in Figure 1.


Can representation learning for multimodal image registration be improved by supervision of intermediate layers?

arXiv.org Artificial Intelligence

Multimodal imaging and correlative analysis typically require image alignment. Contrastive learning can generate representations of multimodal images, reducing the challenging task of multimodal image registration to a monomodal one. Previously, additional supervision on intermediate layers in contrastive learning has improved biomedical image classification. We evaluate if a similar approach improves representations learned for registration to boost registration performance. We explore three approaches to add contrastive supervision to the latent features of the bottleneck layer in the U-Nets encoding the multimodal images and evaluate three different critic functions. Our results show that representations learned without additional supervision on latent features perform best in the downstream task of registration on two public biomedical datasets. We investigate the performance drop by exploiting recent insights in contrastive learning in classification and self-supervised learning. We visualize the spatial relations of the learned representations by means of multidimensional scaling, and show that additional supervision on the bottleneck layer can lead to partial dimensional collapse of the intermediate embedding space.


Video4MRI: An Empirical Study on Brain Magnetic Resonance Image Analytics with CNN-based Video Classification Frameworks

arXiv.org Artificial Intelligence

To address the problem of medical image recognition, computer vision techniques like convolutional neural networks (CNN) are frequently used. Recently, 3D CNN-based models dominate the field of magnetic resonance image (MRI) analytics. Due to the high similarity between MRI data and videos, we conduct extensive empirical studies on video recognition techniques for MRI classification to answer the questions: (1) can we directly use video recognition models for MRI classification, (2) which model is more appropriate for MRI, (3) are the common tricks like data augmentation in video recognition still useful for MRI classification? Our work suggests that advanced video techniques benefit MRI classification. In this paper, four datasets of Alzheimer's and Parkinson's disease recognition are utilized in experiments, together with three alternative video recognition models and data augmentation techniques that are frequently applied to video tasks. In terms of efficiency, the results reveal that the video framework performs better than 3D-CNN models by 5% - 11% with 50% - 66% less trainable parameters. This report pushes forward the potential fusion of 3D medical imaging and video understanding research.


Google will blur explicit images in search by default

Engadget

Today is Safer Internet Day and Google is marking the occasion by revealing features designed to, well, make it safer to do things on the internet. The company says that, in the coming months, it will blur explicit images in search results for all users as a default setting, even if they don't have SafeSearch switched on. SafeSearch filtering is already the default for signed-in users under the age of 18. You'll be able to adjust the settings if you don't have a supervised account or you're signed out and you'd prefer to see butts and stuff in search results (the filter is designed to blur violent images as well). According to screenshots that Google shared, the blur setting will mask explicit images, but not text or links. The filter setting covers up all three. Meanwhile, Google is adding another layer of protection to the built-in password manager on Chrome and Android.


Delving Deep into Simplicity Bias for Long-Tailed Image Recognition

arXiv.org Artificial Intelligence

Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically report that self-supervised learning (SSL) can mitigate SB and perform in complementary to the supervised counterpart by enriching the features extracted from tail samples and consequently taking better advantage of such rare samples. However, standard SSL methods are designed without explicitly considering the inherent data distribution in terms of classes and may not be optimal for long-tailed distributed data. To address this limitation, we propose a novel SSL method tailored to imbalanced data. It leverages SSL by triple diverse levels, i.e., holistic-, partial-, and augmented-level, to enhance the learning of predictive complex patterns, which provides the potential to overcome the severe SB on tail data. Both quantitative and qualitative experimental results on five long-tailed benchmark datasets show our method can effectively mitigate SB and significantly outperform the competing state-of-the-arts.


Recurrence With Correlation Network for Medical Image Registration

arXiv.org Artificial Intelligence

We present Recurrence with Correlation Network (RWCNet), a medical image registration network with multi-scale features and a cost volume layer. We demonstrate that these architectural features improve medical image registration accuracy in two image registration datasets prepared for the MICCAI 2022 Learn2Reg Workshop Challenge. On the large-displacement National Lung Screening Test (NLST) dataset, RWCNet is able to achieve a total registration error (TRE) of 2.11mm between corresponding keypoints without instance fine-tuning. On the OASIS brain MRI dataset, RWCNet is able to achieve an average dice overlap of 81.7% for 35 different anatomical labels. It outperforms another multi-scale network, the Laplacian Image Registration Network (LapIRN), on both datasets. Ablation experiments are performed to highlight the contribution of the various architectural features. While multi-scale features improved validation accuracy for both datasets, the cost volume layer and number of recurrent steps only improved performance on the large-displacement NLST dataset. This result suggests that cost volume layer and iterative refinement using RNN provide good support for optimization and generalization in large-displacement medical image registration. The code for RWCNet is available at https://github.com/vigsivan/optimization-based-registration.


Self-supervised Multi-view Disentanglement for Expansion of Visual Collections

arXiv.org Artificial Intelligence

Image search engines enable the retrieval of images relevant to a query image. In this work, we consider the setting where a query for similar images is derived from a collection of images. For visual search, the similarity measurements may be made along multiple axes, or views, such as style and color. We assume access to a set of feature extractors, each of which computes representations for a specific view. Our objective is to design a retrieval algorithm that effectively combines similarities computed over representations from multiple views. To this end, we propose a self-supervised learning method for extracting disentangled view-specific representations for images such that the inter-view overlap is minimized. We show how this allows us to compute the intent of a collection as a distribution over views. We show how effective retrieval can be performed by prioritizing candidate expansion images that match the intent of a query collection. Finally, we present a new querying mechanism for image search enabled by composing multiple collections and perform retrieval under this setting using the techniques presented in this paper.


Review -- Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

#artificialintelligence

The interaction with all the other white tokens can be achieved when sMLP is executed twice. It consists of three branches: two of them are responsible for mixing information along horizontal and vertical directions respectively and the other path is the identity mapping. The output of the three branches are concatenated and processed by a pointwise convolution to obtain the final output. We can see that MLP-Mixer cannot afford a high-resolution input or the pyramid processing, as the computational complexity grows with N². In contrast, the computational complexity of the proposed sMLP grows with N N.