Not enough data to create a plot.
Try a different view from the menu above.
Vondrick, Carl
PaperBot: Learning to Design Real-World Tools Using Paper
Liu, Ruoshi, Liang, Junbang, Sudhakar, Sruthi, Ha, Huy, Chi, Cheng, Song, Shuran, Vondrick, Carl
Paper is a cheap, recyclable, and clean material that is often used to make practical tools. Traditional tool design either relies on simulation or physical analysis, which is often inaccurate and time-consuming. In this paper, we propose PaperBot, an approach that directly learns to design and use a tool in the real world using paper without human intervention. We demonstrated the effectiveness and efficiency of PaperBot on two tool design tasks: 1. learning to fold and throw paper airplanes for maximum travel distance 2. learning to cut paper into grippers that exert maximum gripping force. We present a self-supervised learning framework that learns to perform a sequence of folding, cutting, and dynamic manipulation actions in order to optimize the design and use of a tool. We deploy our system to a real-world two-arm robotic system to solve challenging design tasks that involve aerodynamics (paper airplane) and friction (paper gripper) that are impossible to simulate accurately.
GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering
Hamdi, Abdullah, Melas-Kyriazi, Luke, Qian, Guocheng, Mai, Jinjie, Liu, Ruoshi, Vondrick, Carl, Ghanem, Bernard, Vedaldi, Andrea
Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However, it may require a large number of Gaussians, which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting), a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer particles to represent a scene and thus significantly outperforming Gaussian Splatting methods in efficiency with a plug-and-play replacement ability for Gaussian-based utilities. GES is validated theoretically and empirically in both principled 1D setup and realistic 3D scenes. It is shown to represent signals with sharp edges more accurately, which are typically challenging for Gaussians due to their inherent low-pass characteristics. Our empirical analysis demonstrates that GEF outperforms Gaussians in fitting natural-occurring signals (e.g. squares, triangles, and parabolic signals), thereby reducing the need for extensive splitting operations that increase the memory footprint of Gaussian Splatting. With the aid of a frequency-modulated loss, GES achieves competitive performance in novel-view synthesis benchmarks while requiring less than half the memory storage of Gaussian Splatting and increasing the rendering speed by up to 39%. The code is available on the project website https://abdullahamdi.com/ges .
ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation
Yu, Sungduk, Hannah, Walter, Peng, Liran, Lin, Jerry, Bhouri, Mohamed Aziz, Gupta, Ritwik, Lütjens, Björn, Will, Justus Christopher, Behrens, Gunnar, Busecke, Julius, Loose, Nora, Stern, Charles I, Beucler, Tom, Harrop, Bryce, Hillman, Benjamin R, Jenney, Andrea, Ferretti, Savannah, Liu, Nana, Anandkumar, Anima, Brenowitz, Noah D, Eyring, Veronika, Geneva, Nicholas, Gentine, Pierre, Mandt, Stephan, Pathak, Jaideep, Subramaniam, Akshay, Vondrick, Carl, Yu, Rose, Zanna, Laure, Zheng, Tian, Abernathey, Ryan, Ahmed, Fiaz, Bader, David C, Baldi, Pierre, Barnes, Elizabeth, Bretherton, Christopher, Caldwell, Peter, Chuang, Wayne, Han, Yilun, Huang, Yu, Iglesias-Suarez, Fernando, Jantre, Sanket, Kashinath, Karthik, Khairoutdinov, Marat, Kurth, Thorsten, Lutsko, Nicholas, Ma, Po-Lun, Mooers, Griffin, Neelin, J. David, Randall, David, Shamekh, Sara, Taylor, Mark A, Urban, Nathan, Yuval, Janni, Zhang, Guang, Pritchard, Michael
Modern climate projections lack adequate spatial and temporal resolution due to computational constraints. A consequence is inaccurate and imprecise predictions of critical processes such as storms. Hybrid methods that combine physics with machine learning (ML) have introduced a new generation of higher fidelity climate simulators that can sidestep Moore's Law by outsourcing compute-hungry, short, high-resolution simulations to ML emulators. However, this hybrid ML-physics simulation approach requires domain-specific treatment and has been inaccessible to ML experts because of lack of training data and relevant, easy-to-use workflows. We present ClimSim, the largest-ever dataset designed for hybrid ML-physics research. It comprises multi-scale climate simulations, developed by a consortium of climate scientists and ML researchers. It consists of 5.7 billion pairs of multivariate input and output vectors that isolate the influence of locally-nested, high-resolution, high-fidelity physics on a host climate simulator's macro-scale physical state. The dataset is global in coverage, spans multiple years at high sampling frequency, and is designed such that resulting emulators are compatible with downstream coupling into operational climate simulators. We implement a range of deterministic and stochastic regression baselines to highlight the ML challenges and their scoring.
pix2gestalt: Amodal Segmentation by Synthesizing Wholes
Ozguroglu, Ege, Liu, Ruoshi, Surís, Dídac, Chen, Dian, Dave, Achal, Tokmakov, Pavel, Vondrick, Carl
Our approach capitalizes on diffusion models and transferring their representations to denoising diffusion models [14], which are excellent representations this task, we learn a conditional diffusion model for reconstructing of the natural image manifold and capture all whole objects in challenging zero-shot cases, including different types of whole objects and their occlusions. Due examples that break natural and physical priors, to their large-scale training data, we hypothesize such pretrained such as art. As training data, we use a synthetically curated models have implicitly learned amodal representations dataset containing occluded objects paired with their whole (Figure 2), which we can reconfigure to encode object counterparts. Experiments show that our approach outperforms grouping and perform amodal completion. By learning supervised baselines on established benchmarks. Our from a synthetic dataset of occlusions and their whole counterparts, model can furthermore be used to significantly improve the we create a conditional diffusion model that, given performance of existing object recognition and 3D reconstruction an RGB image and a point prompt, generates whole objects methods in the presence of occlusions.
Raidar: geneRative AI Detection viA Rewriting
Mao, Chengzhi, Vondrick, Carl, Wang, Hao, Yang, Junfeng
We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. This tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
Mall, Utkarsh, Phoo, Cheng Perng, Liu, Meilin Kelsey, Vondrick, Carl, Hariharan, Bharath, Bala, Kavita
We introduce a method to train vision-language models for remote-sensing images without using any textual annotations. Our key insight is to use co-located internet imagery taken on the ground as an intermediary for connecting remote-sensing images and language. Specifically, we train an image encoder for remote sensing images to align with the image encoder of CLIP using a large amount of paired internet and satellite images. Our unsupervised approach enables the training of a first-of-its-kind large-scale vision language model (VLM) for remote sensing images at two different resolutions. We show that these VLMs enable zero-shot, open-vocabulary image classification, retrieval, segmentation and visual question answering for satellite images. On each of these tasks, our VLM trained without textual annotations outperforms existing VLMs trained with supervision, with gains of up to 20% for classification and 80% for segmentation. Our planet is constantly captured by an extensive array of remote sensors such as satellites or drones. These earth observation images enable the monitoring of various events on the earth such as deforestation, forest fires, and droughts so that rapid actions can be taken to protect our environment. While these images can shed light on various insights about our planet, the scale of such data is huge. This has prompted the development of automatic analysis models that could extract relevant information from a large amount of remotely sensed images. While useful, these models are often specialized and can only recognize a pre-defined set of concepts. Besides, they could be complex, decreasing their accessibility to experts outside of the domain of artificial intelligence. Researchers developing automatic analysis methods for internet imagery encountered a similar problem a few years ago. One promising solution is to leverage large-scale vision-language models (VLMs) that are trained on millions or even billions of text-image pairs collected on the internet (Radford et al., 2021; Li et al., 2023). These models have demonstrated remarkable abilities to perform open-vocabulary recognition (Gu et al., 2022; Kuo et al., 2023) and enhance accessibility to non-AI experts (Alayrac et al., 2022; Surís et al., 2023). It would be incredibly valuable for a range of applications to replicate the success of openvocabulary recognition for satellite images as well, allowing an analyst to simply query, say, "Where are all the farmlands in the state of Massachusetts?" without requiring any new training or annotation for farms.
SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors
Chen, Hongge, Chen, Zhao, Meyer, Gregory P., Park, Dennis, Vondrick, Carl, Shrivastava, Ashish, Chai, Yuning
We present SHIFT3D, a differentiable pipeline for generating 3D shapes that are structurally plausible yet challenging to 3D object detectors. In safety-critical applications like autonomous driving, discovering such novel challenging objects can offer insight into unknown vulnerabilities of 3D detectors. By representing objects with a signed distanced function (SDF), we show that gradient error signals allow us to smoothly deform the shape or pose of a 3D object in order to confuse a downstream 3D detector. Importantly, the objects generated by SHIFT3D physically differ from the baseline object yet retain a semantically recognizable shape. Our approach provides interpretable failure modes for modern 3D object detectors, and can aid in preemptive discovery of potential safety risks within 3D perception systems before these risks become critical failures.
SURFSUP: Learning Fluid Simulation for Novel Surfaces
Mani, Arjun, Chandratreya, Ishaan Preetam, Creager, Elliot, Vondrick, Carl, Zemel, Richard
Modeling the mechanics of fluid in complex scenes is vital to applications in design, graphics, and robotics. Learning-based methods provide fast and differentiable fluid simulators, however most prior work is unable to accurately model how fluids interact with genuinely novel surfaces not seen during training. We introduce SURFSUP, a framework that represents objects implicitly using signed distance functions (SDFs), rather than an explicit representation of meshes or particles. This continuous representation of geometry enables more accurate simulation of fluid-object interactions over long time periods while simultaneously making computation more efficient. Moreover, SURFSUP trained on simple shape primitives generalizes considerably out-of-distribution, even to complex real-world scenes and objects. Finally, we show we can invert our model to design simple objects to manipulate fluid flow.
Objaverse-XL: A Universe of 10M+ 3D Objects
Deitke, Matt, Liu, Ruoshi, Wallingford, Matthew, Ngo, Huong, Michel, Oscar, Kusupati, Aditya, Fan, Alan, Laforte, Christian, Voleti, Vikram, Gadre, Samir Yitzhak, VanderBilt, Eli, Kembhavi, Aniruddha, Vondrick, Carl, Gkioxari, Georgia, Ehsani, Kiana, Schmidt, Ludwig, Farhadi, Ali
Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.
Sin3DM: Learning a Diffusion Model from a Single 3D Textured Shape
Wu, Rundi, Liu, Ruoshi, Vondrick, Carl, Zheng, Changxi
Synthesizing novel 3D models that resemble the input example has long been pursued by researchers and artists in computer graphics. In this paper, we present Sin3DM, a diffusion model that learns the internal patch distribution from a single 3D textured shape and generates high-quality variations with fine geometry and texture details. Training a diffusion model directly in 3D would induce large memory and computational cost. Therefore, we first compress the input into a lower-dimensional latent space and then train a diffusion model on it. Specifically, we encode the input 3D textured shape into triplane feature maps that represent the signed distance and texture fields of the input. The denoising network of our diffusion model has a limited receptive field to avoid overfitting, and uses triplane-aware 2D convolution blocks to improve the result quality. Aside from randomly generating new samples, our model also facilitates applications such as retargeting, outpainting and local editing. Through extensive qualitative and quantitative evaluation, we show that our model can generate 3D shapes of various types with better quality than prior methods.