AITopics | Fu, Stephanie

Collaborating Authors

Fu, Stephanie

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

When Does Perceptual Alignment Benefit Vision Representations?

Sundaram, Shobhita, Fu, Stephanie, Muttenthaler, Lukas, Tamir, Netanel Y., Chai, Lucy, Kornblith, Simon, Darrell, Trevor, Isola, Phillip

arXiv.org Artificial IntelligenceOct-14-2024

Humans judge perceptual similarity according to diverse visual attributes, including scene layout, subject location, and camera pose. Existing vision models understand a wide range of semantic abstractions but improperly weigh these attributes and thus make inferences misaligned with human perception. While vision representations have previously benefited from alignment in contexts like image generation, the utility of perceptually aligned representations in more general-purpose settings remains unclear. Here, we investigate how aligning vision model representations to human perceptual judgments impacts their usability across diverse computer vision tasks. We finetune state-of-the-art models on human similarity judgments for image triplets and evaluate them across standard vision benchmarks. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks, including counting, segmentation, depth estimation, instance retrieval, and retrieval-augmented generation. In addition, we find that performance is widely preserved on other tasks, including specialized out-of-distribution domains such as in medical imaging and 3D environment frames. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.10817

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

Gupta, Ritwik, Walker, Leah, Corona, Rodolfo, Fu, Stephanie, Petryk, Suzanne, Napolitano, Janet, Darrell, Trevor, Reddie, Andrew W.

arXiv.org Artificial IntelligenceSep-25-2024

Current regulations on powerful AI capabilities are narrowly focused on "foundation" or "frontier" models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) "small" models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2409.17216

Country:

Europe (1.00)
North America > United States > California (0.14)
Asia > Middle East > UAE (0.14)
Asia > Middle East > Qatar (0.14)

Genre: Research Report (0.82)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Astruc, Guillaume, Dufour, Nicolas, Siglidis, Ioannis, Aronssohn, Constantin, Bouia, Nacim, Fu, Stephanie, Loiseau, Romain, Nguyen, Van Nguyen, Raude, Charles, Vincent, Elliot, XU, Lintao, Zhou, Hongyu, Landrieu, Loic

arXiv.org Artificial IntelligenceApr-29-2024

Determining the location of an image anywhere on Earth is a complex visual task, which makes it particularly relevant for evaluating computer vision algorithms. Yet, the absence of standard, large-scale, open-access datasets with reliably localizable images has limited its potential. To address this issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images, covering 225 countries and territories. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2404.18873

Country:

Asia (0.67)
Europe > Italy (0.28)

Genre: Research Report (0.64)

Industry:

Law (0.92)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.93)
(2 more...)

Add feedback

FeatUp: A Model-Agnostic Framework for Features at Any Resolution

Fu, Stephanie, Hamilton, Mark, Brandt, Laura, Feldman, Axel, Zhang, Zhoutong, Freeman, William T.

arXiv.org Artificial IntelligenceApr-1-2024

High-res features can be learned either as a per-image implicit network or a general-purpose upsampling operation; the latter is a drop-in module to improve downstream dense prediction tasks. Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero-or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task-and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation. Despite their immense success, deep features often sacrifice spatial resolution for semantic quality. For example, ResNet-50 (He et al., 2015) produces 7 7 deep Work done while at MIT. FeatUp learns to upsample features through a consistency loss on low resolution "views" of a model's features that arise from slight transformations of the input image. Even Vision Transformers (ViTs) (Dosovitskiy et al., 2020) incur a significant resolution reduction, making it challenging to perform dense prediction tasks such as segmentation or depth estimation using these features alone. To mitigate these issues, we propose FeatUp: a novel framework to improve the resolution of any vision model's features without changing their original "meaning" or orientation. Our primary insight, inspired by 3D reconstruction frameworks like NeRF (Mildenhall et al., 2020), is that multiview consistency of low-resolution signals can supervise the construction of high-resolution signals. More specifically, we learn high-resolution information by aggregating low resolution views from a model's outputs across multiple "jittered" (e.g. Our work explores two architectures for upsampling: a single guided upsampling feedforward network that generalizes across images, and an implicit representation overfit to a single image.

artificial intelligence, featup, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2403.10516

Country:

North America > United States (1.00)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

A Vision Check-up for Language Models

Sharma, Pratyusha, Shaham, Tamar Rott, Baradad, Manel, Fu, Stephanie, Rodriguez-Munoz, Adrian, Duggal, Shivam, Isola, Phillip, Torralba, Antonio

arXiv.org Artificial IntelligenceJan-3-2024

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2401.01862

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Fu, Stephanie, Tamir, Netanel, Sundaram, Shobhita, Chai, Lucy, Zhang, Richard, Dekel, Tali, Isola, Phillip

arXiv.org Artificial IntelligenceDec-8-2023

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2306.09344

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(2 more...)

Add feedback

Conditional Image Retrieval

Hamilton, Mark, Fu, Stephanie, Lu, Mindren, Freeman, William T.

arXiv.org Machine LearningSep-18-2020

This work introduces Conditional Image Retrieval (CIR) systems: IR methods that can efficiently specialize to specific subsets of images on the fly. These systems broaden the class of queries IR systems support, and eliminate the need for expensive re-fitting to specific subsets of data. Specifically, we adapt tree-based K-Nearest Neighbor (KNN) data-structures to the conditional setting by introducing additional inverted-index data-structures. This speeds conditional queries and does not slow queries without conditioning. We present two new datasets for evaluating the performance of CIR systems and evaluate a variety of design choices. As a motivating application, we present an algorithm that can explore shared semantic content between works of art of vastly different media and cultural origin. Finally, we demonstrate that CIR data-structures can identify Generative Adversarial Network (GAN) "blind spots": areas where GANs fail to properly model the true data distribution.

dataset, deep learning, neural network, (22 more...)

arXiv.org Machine Learning

2007.07177

Country: North America > United States (0.29)

Genre: Research Report (0.83)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback