AITopics | Duggal, Shivam

Collaborating Authors

Duggal, Shivam

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adaptive Length Image Tokenization via Recurrent Allocation

Duggal, Shivam, Isola, Phillip, Torralba, Antonio, Freeman, William T.

arXiv.org Artificial IntelligenceNov-4-2024

Current vision systems typically assign fixed-length representations to images, regardless of the information content. This contrasts with human intelligence --and even large language models--which allocate varying representational capacities based on entropy, context and familiarity. Inspired by this, we propose an approach to learn variable-length token representations for 2D images. Our encoder-decoder architecture recursively processes 2D image tokens, distilling them into 1D latent tokens over multiple iterations of recurrent rollouts. Each iteration refines the 2D tokens, updates the existing 1D latent tokens, and adaptively increases representational capacity by adding new tokens. This enables compression of images into a variable number of tokens, ranging from 32 to 256. Recurrent token processing with increasing representational capacity in each iteration shows signs of token specialization, revealing potential for object / part discovery. Representation learning (Bengio et al., 2013), which involves extracting meaningful and useful information from input observations, is crucial for decision-making. An effective representation should be compact while encoding all relevant information. However, what constitutes "relevant" information varies based on the specific task; for example, a coarse classification task may require a different latent representation compression factor for satisfactory performance compared to a task demanding perfect pixel-level reconstruction, which necessitates denser representations. Similarly, language models can describe content at various levels of abstraction depending on complexity, context (Graves, 2016; Dehghani et al., 2018), and familiarity (Baevski & Auli, 2018). In contrast, most current visual systems, such as VAEs, VQGANs, and ViTs (Kingma & Welling, 2022; Esser et al., 2020; Dosovitskiy et al., 2020), generate fixed-size representations for all images. In this work, we take a step toward learning adaptive and variable-length visual representations, emphasizing that each image requires a different representation capacity (see Sec. 4).

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.02393

Genre: Research Report (0.42)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

A Vision Check-up for Language Models

Sharma, Pratyusha, Shaham, Tamar Rott, Baradad, Manel, Fu, Stephanie, Rodriguez-Munoz, Adrian, Duggal, Shivam, Isola, Phillip, Torralba, Antonio

arXiv.org Artificial IntelligenceJan-3-2024

What does learning to model relationships between strings teach large language models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels, we use code to represent images in our study. Although LLM-generated images do not look like natural images, results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore, experiments on self-supervised visual representation learning, utilizing images generated with text models, highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2401.01862

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Your Diffusion Model is Secretly a Zero-Shot Classifier

Li, Alexander C., Prabhudesai, Mihir, Duggal, Shivam, Brown, Ellis, Pathak, Deepak

arXiv.org Artificial IntelligenceSep-12-2023

The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation. In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, our diffusion-based approach has significantly stronger multimodal compositional reasoning ability than competing discriminative approaches. Finally, we use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. Our models achieve strong classification performance using only weak augmentations and exhibit qualitatively better "effective robustness" to distribution shift. Overall, our results are a step toward using generative over discriminative models for downstream tasks. Results and visualizations at https://diffusion-classifier.github.io/

artificial intelligence, machine learning, zero-shot classifier, (2 more...)

arXiv.org Artificial Intelligence

2303.16203

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.80)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.73)

Add feedback

GeoSim: Photorealistic Image Simulation with Geometry-Aware Composition

Chen, Yun, Rong, Frieda, Duggal, Shivam, Wang, Shenlong, Yan, Xinchen, Manivasagam, Sivabalan, Xue, Shangjie, Yumer, Ersin, Urtasun, Raquel

arXiv.org Artificial IntelligenceJan-16-2021

Scalable sensor simulation is an important yet challenging open problem for safety-critical domains such as self-driving. Current work in image simulation either fail to be photorealistic or do not model the 3D environment and the dynamic objects within, losing high-level control and physical realism. In this paper, we present GeoSim, a geometry-aware image composition process that synthesizes novel urban driving scenes by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses. Towards this goal, we first build a diverse bank of 3D objects with both realistic geometry and appearance from sensor data. During simulation, we perform a novel geometry-aware simulation-by-composition procedure which 1) proposes plausible and realistic object placements into a given scene, 2) renders novel views of dynamic objects from the asset bank, and 3) composes and blends the rendered image segments. The resulting synthetic images are photorealistic, traffic-aware, and geometrically consistent, allowing image simulation to scale to complex use cases. We demonstrate two such important applications: long-range realistic video simulation across multiple camera sensors, and synthetic data generation for data augmentation on downstream segmentation tasks.

geosim, ground transportation, neural network, (18 more...)

arXiv.org Artificial Intelligence

2101.06543

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (0.82)

Industry:

Transportation (0.68)
Media (0.48)
Leisure & Entertainment (0.48)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.67)

Add feedback