AITopics | Vondrick, Carl

Collaborating Authors

Vondrick, Carl

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DiSciPLE: Learning Interpretable Programs for Scientific Visual Discovery

Mall, Utkarsh, Phoo, Cheng Perng, Chiquier, Mia, Hariharan, Bharath, Bala, Kavita, Vondrick, Carl

arXiv.org Artificial IntelligenceFeb-14-2025

Visual data is used in numerous different scientific workflows ranging from remote sensing to ecology. As the amount of observation data increases, the challenge is not just to make accurate predictions but also to understand the underlying mechanisms for those predictions. Good interpretation is important in scientific workflows, as it allows for better decision-making by providing insights into the data. This paper introduces an automatic way of obtaining such interpretable-by-design models, by learning programs that interleave neural networks. We propose DiSciPLE (Discovering Scientific Programs using LLMs and Evolution) an evolutionary algorithm that leverages common sense and prior knowledge of large language models (LLMs) to create Python programs explaining visual data. Additionally, we propose two improvements: a program critic and a program simplifier to improve our method further to synthesize good programs. On three different real-world problems, DiSciPLE learns state-of-the-art programs on novel tasks with no prior literature. For example, we can learn programs with 35% lower error than the closest non-interpretable baseline for population density estimation.

evolutionary algorithm, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.1006

Country: North America > United States (1.00)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Education (0.66)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.89)

Add feedback

Generative Data Mining with Longtail-Guided Diffusion

Hayden, David S., Ye, Mao, Garipov, Timur, Meyer, Gregory P., Vondrick, Carl, Chen, Zhao, Chai, Yuning, Wolff, Eric, Srinivasa, Siddhartha S.

arXiv.org Artificial IntelligenceFeb-3-2025

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on image classification benchmarks, and can be analyzed to proactively discover, explain, and address conceptual gaps in a predictive model.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.0198

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(2 more...)

Add feedback

Self-Improving Autonomous Underwater Manipulation

Liu, Ruoshi, Ha, Huy, Hou, Mengxue, Song, Shuran, Vondrick, Carl

arXiv.org Artificial IntelligenceOct-24-2024

Abstract-- Underwater robotic manipulation faces significant challenges due to complex fluid dynamics and unstructured environments, causing most manipulation systems to rely heavily on human teleoperation. In this paper, we introduce AquaBot, a fully autonomous manipulation system that combines behavior cloning from human demonstrations with self-learning optimization to improve beyond human teleoperation performance. With extensive real-world experiments, we demonstrate AquaBot's versatility across diverse manipulation tasks, including object grasping, trash sorting, and rescue retrieval. Our real-world experiments show that AquaBot's self-optimized policy outperforms a human operator by 41% in speed. AquaBot represents a promising step towards autonomous and self-improving underwater manipulation systems.

artificial intelligence, machine learning, manipulation, (15 more...)

arXiv.org Artificial Intelligence

2410.18969

Genre: Research Report (0.40)

Industry: Energy > Oil & Gas (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Differentiable Robot Rendering

Liu, Ruoshi, Canberk, Alper, Song, Shuran, Vondrick, Carl

arXiv.org Artificial IntelligenceOct-17-2024

Vision foundation models trained on massive amounts of visual data have shown unprecedented reasoning and planning skills in open-world settings. A key challenge in applying them to robotic tasks is the modality gap between visual data and action data. We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters. Our model integrates a kinematics-aware deformable model and Gaussians Splatting and is compatible with any robot form factors and degrees of freedom. We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models. Quantitative and qualitative results show that our differentiable rendering model provides effective gradients for robotic control directly from pixels, setting the foundation for the future applications of vision foundation models in robotics.

artificial intelligence, gaussian, robot, (16 more...)

arXiv.org Artificial Intelligence

2410.13851

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Van Hoorick, Basile, Wu, Rundi, Ozguroglu, Ege, Sargent, Kyle, Liu, Ruoshi, Tokmakov, Pavel, Dave, Achal, Zheng, Changxi, Vondrick, Carl

arXiv.org Artificial IntelligenceJul-5-2024

Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose $\textbf{GCD}$, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

large language model, machine learning, natural language, (13 more...)

arXiv.org Artificial Intelligence

2405.14868

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Liang, Junbang, Liu, Ruoshi, Ozguroglu, Ege, Sudhakar, Sruthi, Dave, Achal, Tokmakov, Pavel, Song, Shuran, Vondrick, Carl

arXiv.org Artificial IntelligenceJun-24-2024

A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.

artificial intelligence, diffusion policy, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2406.16862

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Menon, Sachit, Zemel, Richard, Vondrick, Carl

arXiv.org Artificial IntelligenceJun-20-2024

When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.14562

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding

Ananthram, Amith, Stengel-Eskin, Elias, Vondrick, Carl, Bansal, Mohit, McKeown, Kathleen

arXiv.org Artificial IntelligenceJun-17-2024

Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2406.11665

Country:

North America > United States (0.93)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.63)

Add feedback

Evolving Interpretable Visual Classifiers with Large Language Models

Chiquier, Mia, Mall, Utkarsh, Vondrick, Carl

arXiv.org Artificial IntelligenceApr-15-2024

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

classification, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2404.09941

Country: North America > United States (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry: Transportation > Air (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

SelfIE: Self-Interpretation of Large Language Model Embeddings

Chen, Haozhe, Vondrick, Carl, Mao, Chengzhi

arXiv.org Artificial IntelligenceMar-25-2024

How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliability, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2403.10949

Country:

North America > United States (1.00)
Asia (0.68)
North America > Canada > Quebec > Montreal (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback