AITopics

2404.05465

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Artificial IntelligenceDec-14-2023

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Bousselham, Walid, Petersen, Felix, Ferrari, Vittorio, Kuehne, Hilde

Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.

large language model, machine learning, self-self attention, (21 more...)

2312.00878

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

arXiv.org Artificial IntelligenceOct-12-2023

StoryBench: A Multifaceted Benchmark for Continuous Story Visualization

Bugliarello, Emanuele, Moraldo, Hernan, Villegas, Ruben, Babaeizadeh, Mohammad, Saffar, Mohammad Taghi, Zhang, Han, Erhan, Dumitru, Ferrari, Vittorio, Kindermans, Pieter-Jan, Voigtlaender, Paul

Generating video stories from text prompts is a complex task. In addition to having high visual quality, videos need to realistically adhere to a sequence of text prompts whilst being consistent throughout the frames. Creating a benchmark for video generation requires data annotated over time, which contrasts with the single caption used often in video datasets. To fill this gap, we collect comprehensive human annotations on three existing datasets, and introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate forthcoming text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, where the next action must be generated starting from a conditioning video; story continuation, where a sequence of actions must be executed starting from a conditioning video; and story generation, where a video must be generated from only text prompts. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions. Finally, we establish guidelines for human evaluation of video stories, and reaffirm the need of better automatic metrics for video generation. StoryBench aims at encouraging future research efforts in this exciting new area. Work completed during an internship at Google.

artificial intelligence, machine learning, natural language, (21 more...)

2308.11606

Country:

North America > Canada (0.14)
Europe > Italy (0.14)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Security & Privacy (0.92)
(2 more...)

arXiv.org Artificial IntelligenceMay-12-2023

Agile Modeling: From Concept to Classifier in Minutes

Stretcu, Otilia, Vendrow, Edward, Hata, Kenji, Viswanathan, Krishnamurthy, Ferrari, Vittorio, Tavakkol, Sasan, Zhou, Wenlei, Avinash, Aditya, Luo, Enming, Alldrin, Neil Gordon, Bateni, MohammadHossein, Berger, Gabriel, Bunner, Andrew, Lu, Chun-Ta, Rey, Javier A, DeSalvo, Giulia, Krishna, Ranjay, Fuxman, Ariel

The application of computer vision to nuanced subjective use cases is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a "zebra"), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying "gourmet tuna"). However, empowering any user to develop a classifier for their concept is technically difficult: users are neither machine learning experts, nor have the patience to label thousands of examples. In reaction, we introduce the problem of Agile Modeling: the process of turning any subjective visual concept into a computer vision model through a real-time user-in-the-loop interactions. We instantiate an Agile Modeling prototype for image classification and show through a user study (N=14) that users can create classifiers with minimal effort under 30 minutes. We compare this user driven process with the traditional crowdsourcing paradigm and find that the crowd's notion often differs from that of the user's, especially as the concepts become more subjective. Finally, we scale our experiments with simulations of users training classifiers for ImageNet21k categories to further demonstrate the efficacy.

artificial intelligence, machine learning, natural language, (19 more...)

2302.12948

Country: North America > United States (0.67)

Genre: Research Report (0.64)

Industry:

Education (0.93)
Health & Medicine > Consumer Health (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

arXiv.org Artificial IntelligenceNov-29-2021

Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos

Rozumnyi, Denys, Oswald, Martin R., Ferrari, Vittorio, Pollefeys, Marc

We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video. To this end, we model the blurred appearance of a fast moving object in a generative fashion by parametrizing its 3D position, rotation, velocity, acceleration, bounces, shape, and texture over the duration of a predefined time window spanning multiple frames. Using differentiable rendering, we are able to estimate all parameters by minimizing the pixel-wise reprojection error to the input video via backpropagating through a rendering pipeline that accounts for motion blur by averaging the graphics output over short time intervals. For that purpose, we also estimate the camera exposure gap time within the same optimization. To account for abrupt motion changes like bounces, we model the motion trajectory as a piece-wise polynomial, and we are able to estimate the specific time of the bounce at sub-frame accuracy. Experiments on established benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction.

artificial intelligence, machine learning, rotation, (16 more...)

2111.14465

Country: North America > United States > Tennessee (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Graphics (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsDec-31-2012

Searching for objects driven by context

Alexe, Bogdan, Heess, Nicolas, Teh, Yee W., Ferrari, Vittorio

The dominant visual search paradigm for object class detection is sliding windows. Although simple and effective, it is also wasteful, unnatural and rigidly hardwired. We propose strategies to search for objects which intelligently explore the space of windows by making sequential observations at locations decided based on previous observations. Our strategies adapt to the class being searched and to the content of a particular test image. Their driving force is exploiting context as the statistical relation between the appearance of a window and its location relative to the object, as observed in the training set. In addition to being more elegant than sliding windows, we demonstrate experimentally on the PASCAL VOC 2010 dataset that our strategies evaluate two orders of magnitude fewer windows while at the same time achieving higher detection accuracy.

artificial intelligence, classifier, machine learning, (17 more...)

Country: Europe (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Neural Information Processing SystemsDec-31-2011

Exploiting spatial overlap to efficiently compute appearance distances between image windows

Alexe, Bogdan, Petrescu, Viviana, Ferrari, Vittorio

We present a computationally efficient technique to compute the distance of high-dimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18]and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average.

appearance distance, artificial intelligence, machine learning, (14 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsDec-31-2009

Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation

Luo, Jie, Caputo, Barbara, Ferrari, Vittorio

Given a corpus of news items consisting of images accompanied by text captions, we want to find out "who's doing what", i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint'face and pose' model solves the correspondence problem better than earlier models covering onlythe face, and that it can perform recognition of new uncaptioned images.

artificial intelligence, caption, us government, (20 more...)

Country: North America > United States (0.68)

Industry:

Government > Regional Government (0.68)
Leisure & Entertainment > Sports (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Neural Information Processing SystemsDec-31-2008

Learning Visual Attributes

Ferrari, Vittorio, Zisserman, Andrew

We present a probabilistic generative model of visual attributes, together with an efficient learning algorithm. Attributes are visual qualities of objects, such as'red', 'striped', or'spotted'. The model sees attributes as patterns of image segments, repeatedly sharing some characteristic properties. These can be any combination of appearance, shape, or the layout of segments within the pattern. Moreover, attributes with general appearance are taken into account, such as the pattern of alternation of any two colors which is characteristic for stripes. To enable learning from unsegmented training images, the model is learnt discriminatively, by optimizing a likelihood ratio. As demonstrated in the experimental evaluation, our model can learn in a weakly supervised setting and encompasses a broad range of attributes. We show that attributes can be learnt starting from a text query to Google image search, and can then be used to recognize the attribute and determine its spatial extent in novel real-world images.

artificial intelligence, image understanding, stripe, (20 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.68)