Goto

Collaborating Authors

 image crop


MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

arXiv.org Artificial Intelligence

Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.


Near-Infrared Hyperspectral Imaging Applications in Food Analysis -- Improving Algorithms and Methodologies

arXiv.org Artificial Intelligence

This thesis investigates the application of near-infrared hyperspectral imaging (NIR-HSI) for food quality analysis. The investigation is conducted through four studies operating with five research hypotheses. For several analyses, the studies compare models based on convolutional neural networks (CNNs) and partial least squares (PLS). Generally, joint spatio-spectral analysis with CNNs outperforms spatial analysis with CNNs and spectral analysis with PLS when modeling parameters where chemical and physical visual information are relevant. When modeling chemical parameters with a 2-dimensional (2D) CNN, augmenting the CNN with an initial layer dedicated to performing spectral convolution enhances its predictive performance by learning a spectral preprocessing similar to that applied by domain experts. Still, PLS-based spectral modeling performs equally well for analysis of the mean content of chemical parameters in samples and is the recommended approach. Modeling the spatial distribution of chemical parameters with NIR-HSI is limited by the ability to obtain spatially resolved reference values. Therefore, a study used bulk mean references for chemical map generation of fat content in pork bellies. A PLS-based approach gave non-smooth chemical maps and pixel-wise predictions outside the range of 0-100\%. Conversely, a 2D CNN augmented with a spectral convolution layer mitigated all issues arising with PLS. The final study attempted to model barley's germinative capacity by analyzing NIR spectra, RGB images, and NIR-HSI images. However, the results were inconclusive due to the dataset's low degree of germination. Additionally, this thesis has led to the development of two open-sourced Python packages. The first facilitates fast PLS-based modeling, while the second facilitates very fast cross-validation of PLS and other classical machine learning models with a new algorithm.


Adding simple structure at inference improves Vision-Language Compositionality

arXiv.org Artificial Intelligence

Dual encoder Vision-Language Models (VLM) such as CLIP are widely used for image-text retrieval tasks. However, those models struggle with compositionality, showing a bag-of-words-like behavior that limits their retrieval performance. Many different training approaches have been proposed to improve the vision-language compositionality capabilities of those models. In comparison, inference-time techniques have received little attention. In this paper, we propose to add simple structure at inference, where, given an image and a caption: i) we divide the image into different smaller crops, ii) we extract text segments, capturing objects, attributes and relations, iii) using a VLM, we find the image crops that better align with text segments obtaining matches, and iv) we compute the final image-text similarity aggregating the individual similarities of the matches. Based on various popular dual encoder VLMs, we evaluate our approach in controlled and natural datasets for VL compositionality. We find that our approach consistently improves the performance of evaluated VLMs without any training, which shows the potential of inference-time techniques. The results are especially good for attribute-object binding as shown in the controlled dataset. As a result of an extensive analysis: i) we show that processing image crops is actually essential for the observed gains in performance, and ii) we identify specific areas to further improve inference-time approaches.


Corr2Distrib: Making Ambiguous Correspondences an Ally to Predict Reliable 6D Pose Distributions

arXiv.org Artificial Intelligence

--We introduce Corr2Distrib, the first correspondence-based method which estimates a 6D camera pose distribution from an RGB image, explaining the observations. Indeed, symmetries and occlusions introduce visual ambiguities, leading to multiple valid poses. While a few recent methods tackle this problem, they do not rely on local correspondences which, according to the BOP Challenge, are currently the most effective way to estimate a single 6DoF pose solution. Using correspondences to estimate a pose distribution is not straightforward, since ambiguous correspondences induced by visual ambiguities drastically decrease the performance of PnP . With Corr2Distrib, we turn these ambiguities into an advantage to recover all valid poses. Corr2Distrib first learns a symmetry-aware representation for each 3D point on the object's surface, characterized by a descriptor and a local frame. This representation enables the generation of 3DoF rotation hypotheses from single 2D-3D correspondences. Next, we refine these hypotheses into a 6DoF pose distribution using PnP and pose scoring. Our experimental evaluations on complex non-synthetic scenes show that Corr2Distrib outperforms state-of-the-art solutions for both pose distribution estimation and single pose estimation from an RGB image, demonstrating the potential of correspondences-based approaches.


Retrieval-Augmented Perception: High-Resolution Image Perception Meets Visual RAG

arXiv.org Artificial Intelligence

High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs). To overcome the limitations of existing methods, this paper shifts away from prior dedicated heuristic approaches and revisits the most fundamental idea to HR perception by enhancing the long-context capability of MLLMs, driven by recent advances in long-context techniques like retrieval-augmented generation (RAG) for general LLMs. Towards this end, this paper presents the first study exploring the use of RAG to address HR perception challenges. Specifically, we propose Retrieval-Augmented Perception (RAP), a training-free framework that retrieves and fuses relevant image crops while preserving spatial context using the proposed Spatial-Awareness Layout. To accommodate different tasks, the proposed Retrieved-Exploration Search (RE-Search) dynamically selects the optimal number of crops based on model confidence and retrieval scores. Experimental results on HR benchmarks demonstrate the significant effectiveness of RAP, with LLaVA-v1.5-13B achieving a 43% improvement on $V^*$ Bench and 19% on HR-Bench.


Reviews: PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph

Neural Information Processing Systems

Limited novelty: The proposed approach is closely related to two lines of related work: 1) sg2im [4] which generates images from scene graph representations, and 2) semi-parametric image synthesis [3], which leverages semantic layouts and training images to generate novel images. The key difference to sg2im is the use of image crops in order to perform semi-parametric synthesis; however, in comparison to prior work on semi-parametric methods [3], as suggested by the authors (Line 82-83) the primary difference is the use of graph convolution architecture, where a similar graph convolution method has been introduced in [4]. I'd like to see more justifications from the authors regarding the technical novelty of this approach in presence of these two lines of work. Limited resolution: My concern about the limited novelty is exacerbated by the fact that the generated images are still in low-resolution (64x64) as prior work [4], even though high-resolution image crops are used to aid the image generation process. In contrast, related work [3] is able to generate images of much higher resolutions, e.g., 512x1024, using their semi-parametric method (which was not compared in the experiment).


Automated Interpretation of Blood Culture Gram Stains using a Deep Convolutional Neural Network

#artificialintelligence

Microscopic interpretation of stained smears is one of the most operator-dependent and time intensive activities in the clinical microbiology laboratory. Here, we investigated application of an automated image acquisition and convolutional neural network (CNN)-based approach for automated Gram stain classification. Using an automated microscopy platform, uncoverslipped slides were scanned with a 40x dry objective, generating images of sufficient resolution for interpretation. We collected 25,488 images from positive blood culture Gram stains prepared during routine clinical workup. These images were used to generate 100,213 crops containing Gram-positive cocci in clusters, Gram-positive cocci in chains/pairs, Gram-negative rods, or background (no cells).


Must Know Tips for Deep Learning Neural Networks, Part 1

@machinelearnbot

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics. In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch.


Must Know Tips for Deep Learning Neural Networks, Part 1

@machinelearnbot

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics. In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch.