AITopics | Hoiem, Derek

Collaborating Authors

Hoiem, Derek

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Can We Generate Visual Programs Without Prompting LLMs?

Shlapentokh-Rothman, Michal, Wang, Yu-Xiong, Hoiem, Derek

arXiv.org Artificial IntelligenceDec-11-2024

Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money. Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments. Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference

annotation, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.08564

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Lu, Jiasen, Clark, Christopher, Lee, Sangho, Zhang, Zichen, Khosla, Savya, Marten, Ryan, Hoiem, Derek, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceDec-28-2023

We present Unified-IO 2, the first autoregressive multimodal model that is capable of understanding and generating image, text, audio, and action. To unify different modalities, we tokenize inputs and outputs -- images, text, audio, action, bounding boxes, etc., into a shared semantic space and then process them with a single encoder-decoder transformer model. Since training with such diverse modalities is challenging, we propose various architectural improvements to stabilize model training. We train our model from scratch on a large multimodal pre-training corpus from diverse sources with a multimodal mixture of denoisers objective. To learn an expansive set of skills, such as following multimodal instructions, we construct and finetune on an ensemble of 120 datasets with prompts and augmentations. With a single unified model, Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and strong results in more than 35 benchmarks, including image generation and understanding, natural language understanding, video and audio understanding, and robotic manipulation. We release all our models to the research community.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2312.17172

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.81)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
(3 more...)

Add feedback

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

Chen, Yangyi, Wang, Xingyao, Li, Manling, Hoiem, Derek, Ji, Heng

arXiv.org Artificial IntelligenceNov-22-2023

State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2311.13258

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.28)
(2 more...)

Genre:

Research Report (1.00)
Overview (0.68)

Industry: Education (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

WebWISE: Web Interface Control and Sequential Exploration with Large Language Models

Tao, Heyi, T, Sethuraman V, Shlapentokh-Rothman, Michal, Hoiem, Derek

arXiv.org Artificial IntelligenceOct-24-2023

The paper investigates using a Large Language Model (LLM) to automatically perform web software tasks using click, scroll, and text input operations. Previous approaches, such as reinforcement learning (RL) or imitation learning, are inefficient to train and task-specific. Our method uses filtered Document Object Model (DOM) elements as observations and performs tasks step-by-step, sequentially generating small programs based on the current observations. We use in-context learning, either benefiting from a single manually provided example, or an automatically generated example based on a successful zero-shot trial. We evaluate the proposed method on the MiniWob++ benchmark. With only one in-context example, our WebWISE method achieves similar or better performance than other methods that require many demonstrations or trials.

interface control and sequential exploration, large language model, natural language, (2 more...)

arXiv.org Artificial Intelligence

2310.16042

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

StyleGAN knows Normal, Depth, Albedo, and More

Bhattad, Anand, McKee, Daniel, Hoiem, Derek, Forsyth, D. A.

arXiv.org Artificial IntelligenceJun-1-2023

Intrinsic images, in the original sense, are image-like maps of scene properties like depth, normal, albedo or shading. This paper demonstrates that StyleGAN can easily be induced to produce intrinsic images. The procedure is straightforward. We show that, if StyleGAN produces $G({w})$ from latents ${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$ so that $G({w}+{d}_c)$ is that type of intrinsic image for $G({w})$. Here ${d}_c$ is {\em independent of ${w}$}. The StyleGAN we used was pretrained by others, so this property is not some accident of our training regime. We show that there are image transformations StyleGAN will {\em not} produce in this fashion, so StyleGAN is not a generic image regression engine. It is conceptually exciting that an image generator should ``know'' and represent intrinsic images. There may also be practical advantages to using a generative model to produce intrinsic images. The intrinsic images obtained from StyleGAN compare well both qualitatively and quantitatively with those obtained by using SOTA image regression techniques; but StyleGAN's intrinsic images are robust to relighting effects, unlike SOTA methods.

artificial intelligence, machine learning, stylegan, (13 more...)

arXiv.org Artificial Intelligence

2306.00987

Country: North America > United States > Illinois (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.35)

Add feedback

Make It So: Steering StyleGAN for Any Image Inversion and Editing

Bhattad, Anand, Shah, Viraj, Hoiem, Derek, Forsyth, D. A.

arXiv.org Artificial IntelligenceApr-27-2023

StyleGAN's disentangled style representation enables powerful image editing by manipulating the latent variables, but accurately mapping real-world images to their latent variables (GAN inversion) remains a challenge. Existing GAN inversion methods struggle to maintain editing directions and produce realistic results. To address these limitations, we propose Make It So, a novel GAN inversion method that operates in the $\mathcal{Z}$ (noise) space rather than the typical $\mathcal{W}$ (latent style) space. Make It So preserves editing capabilities, even for out-of-domain images. This is a crucial property that was overlooked in prior methods. Our quantitative evaluations demonstrate that Make It So outperforms the state-of-the-art method PTI~\cite{roich2021pivotal} by a factor of five in inversion accuracy and achieves ten times better edit quality for complex indoor scenes.

artificial intelligence, inversion, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2304.14403

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Webly Supervised Concept Expansion for General Purpose Vision Models

Kamath, Amita, Clark, Christopher, Gupta, Tanmay, Kolve, Eric, Hoiem, Derek, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceJul-20-2022

General Purpose Vision (GPV) systems are models that are designed to solve a wide array of visual tasks without requiring architectural changes. Today, GPVs primarily learn both skills and concepts from large fully supervised datasets. Scaling GPVs to tens of thousands of concepts by acquiring data to learn each concept for every skill quickly becomes prohibitive. This work presents an effective and inexpensive alternative: learn skills from supervised datasets, learn concepts from web image search, and leverage a key characteristic of GPVs: the ability to transfer visual knowledge across skills. We use a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs (GPV-1 and VL-T5) on 3 benchmarks: 5 Coco-based datasets (80 primary concepts), a newly curated series of 5 datasets based on the OpenImages and VisualGenome repositories ( 500 concepts), and the Web-derived dataset (10k+ concepts). We also propose a new architecture, GPV-2 that supports a variety of tasks -- from vision tasks like classification and localization to vision+language tasks like QA and captioning, to more niche ones like human-object interaction detection. GPV-2 benefits hugely from web data and outperforms GPV-1 and VL-T5 across these benchmarks. Our data, code, and web demo are available at https://prior.allenai.org/projects/gpv2.

gpv-2, machine learning, pattern recognition, (16 more...)

arXiv.org Artificial Intelligence

2202.02317

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Immunology (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Towards General Purpose Vision Systems

Gupta, Tanmay, Kamath, Amita, Kembhavi, Aniruddha, Hoiem, Derek

arXiv.org Artificial IntelligenceApr-1-2021

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system's ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting.

gpv-i, ground transportation, machine translation, (21 more...)

arXiv.org Artificial Intelligence

2104.00743

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry:

Transportation > Ground > Road (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.67)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.46)

Add feedback

Learning Curves for Analysis of Deep Networks

Hoiem, Derek, Gupta, Tanmay, Li, Zhizhong, Shlapentokh-Rothman, Michal M.

arXiv.org Machine LearningOct-21-2020

A learning curve models a classifier's test error as a function of the number of training samples. Prior works show that learning curves can be used to select model parameters and extrapolate performance. We investigate how to use learning curves to analyze the impact of design choices, such as pre-training, architecture, and data augmentation. We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. We also provide several interesting observations based on learning curves for a variety of image classification models.

deep learning, imagenet, neural network, (17 more...)

arXiv.org Machine Learning

2010.11029

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Contrastive Learning for Weakly Supervised Phrase Grounding

Gupta, Tanmay, Vahdat, Arash, Chechik, Gal, Yang, Xiaodong, Kautz, Jan, Hoiem, Derek

arXiv.org Machine LearningAug-5-2020

Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark.

caption, ground transportation, neural network, (19 more...)

arXiv.org Machine Learning

2006.0992

Country: North America > United States > Illinois (0.14)

Genre: Research Report (0.82)

Industry: Transportation > Ground > Road (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)

Add feedback