Goto

Collaborating Authors

 relevancy map


FastRM: An efficient and automatic explainability framework for multimodal generative models

Stan, Gabriela Ben-Melech, Aflalo, Estelle, Luo, Man, Rosenman, Shachar, Le, Tiep, Paul, Sayak, Tseng, Shao-Yen, Lal, Vasudev

arXiv.org Artificial Intelligence

While Large Vision Language Models (LVLMs) have become masterly capable in reasoning over human prompts and visual inputs, they are still prone to producing responses that contain misinformation. Identifying incorrect responses that are not grounded in evidence has become a crucial task in building trustworthy AI. Explainability methods such as gradient-based relevancy maps on LVLM outputs can provide an insight on the decision process of models, however these methods are often computationally expensive and not suited for on-the-fly validation of outputs. In this work, we propose FastRM, an effective way for predicting the explainable Relevancy Maps of LVLM models. Experimental results show that employing FastRM leads to a 99.8% reduction in compute time for relevancy map generation and an 44.4% reduction in memory footprint for the evaluated LVLM, making explainable AI more efficient and practical, thereby facilitating its deployment in real-world applications.


From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models

Pulli, Tessa, Thalhammer, Stefan, Schwaiger, Simon, Vincze, Markus

arXiv.org Artificial Intelligence

Robots are increasingly envisioned to interact in real-world scenarios, where they must continuously adapt to new situations. To detect and grasp novel objects, zero-shot pose estimators determine poses without prior knowledge. Recently, vision language models (VLMs) have shown considerable advances in robotics applications by establishing an understanding between language input and image input. In our work, we take advantage of VLMs zero-shot capabilities and translate this ability to 6D object pose estimation. We propose a novel framework for promptable zero-shot 6D object pose estimation using language embeddings. The idea is to derive a coarse location of an object based on the relevancy map of a language-embedded NeRF reconstruction and to compute the pose estimate with a point cloud registration method. Additionally, we provide an analysis of LERF's suitability for open-set object pose estimation. We examine hyperparameters, such as activation thresholds for relevancy maps and investigate the zero-shot capabilities on an instance- and category-level. Furthermore, we plan to conduct robotic grasping experiments in a real-world setting.


Beyond Image-Text Matching: Verb Understanding in Multimodal Transformers Using Guided Masking

Beňová, Ivana, Košecká, Jana, Gregor, Michal, Tamajka, Martin, Veselý, Marcel, Šimko, Marián

arXiv.org Artificial Intelligence

The dominant probing approaches rely on the zero-shot performance of image-text matching tasks to gain a finer-grained understanding of the representations learned by recent multimodal image-language transformer models. The evaluation is carried out on carefully curated datasets focusing on counting, relations, attributes, and others. This work introduces an alternative probing strategy called guided masking. The proposed approach ablates different modalities using masking and assesses Figure 1: Image from the SVO-Probes dataset (Hendricks the model's ability to predict the masked word and Nematzadeh, 2021). It consists of imagecaption with high accuracy. We focus on studying pairs, where the sentence either correctly describes multimodal models that consider regions of the image (positive example) or one aspect of interest (ROI) features obtained by object detectors the sentence (subject, verb, or object) does not match as input tokens. We probe the understanding the image (negative example). These pairs are used to of verbs using guided masking on probe models through zero-shot image-text matching. ViLBERT, LXMERT, UNITER, and Visual-Example of a positive caption: A person walking on BERT and show that these models can predict a trail.


Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

Ha, Huy, Song, Shuran

arXiv.org Artificial Intelligence

We study open-world 3D scene understanding, a family of tasks that require agents to reason about their 3D environment with an open-set vocabulary and out-of-domain visual inputs - a critical skill for robots to operate in the unstructured 3D world. Towards this end, we propose Semantic Abstraction (SemAbs), a framework that equips 2D Vision-Language Models (VLMs) with new 3D spatial capabilities, while maintaining their zero-shot robustness. We achieve this abstraction using relevancy maps extracted from CLIP, and learn 3D spatial and geometric reasoning skills on top of those abstractions in a semantic-agnostic manner. We demonstrate the usefulness of SemAbs on two open-world 3D scene understanding tasks: 1) completing partially observed objects and 2) localizing hidden objects from language descriptions. Experiments show that SemAbs can generalize to novel vocabulary, materials/lighting, classes, and domains (i.e., real-world scans) from training on limited 3D synthetic data. Code and data is available at https://semantic-abstraction.cs.columbia.edu/


Natural Numerical Networks for Natura 2000 habitats classification by satellite images

Mikula, Karol, Kollar, Michal, Ozvat, Aneta A., Ambroz, Martin, Cahojova, Lucia, Jarolimek, Ivan, Sibik, Jozef, Sibikova, Maria

arXiv.org Artificial Intelligence

Natural numerical networks are introduced as a new classification algorithm based on the numerical solution of nonlinear partial differential equations of forward-backward diffusion type on complete graphs. The proposed natural numerical network is applied to open important environmental and nature conservation task, the automated identification of protected habitats by using satellite images. In the natural numerical network, the forward diffusion causes the movement of points in a feature space toward each other. The opposite effect, keeping the points away from each other, is caused by backward diffusion. This yields the desired classification. The natural numerical network contains a few parameters that are optimized in the learning phase of the method. After learning parameters and optimizing the topology of the network graph, classification necessary for habitat identification is performed. A relevancy map for each habitat is introduced as a tool for validating the classification and finding new Natura 2000 habitat appearances.