AITopics | Hahn, Meera

Collaborating Authors

Hahn, Meera

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

Yu, Sihyun, Hahn, Meera, Kondratyuk, Dan, Shin, Jinwoo, Gupta, Agrim, Lezama, José, Essa, Irfan, Ross, David, Huang, Jonathan

arXiv.org Artificial IntelligenceFeb-18-2025

Diffusion models are successful for synthesizing high-quality videos but are limited to generating short clips (e.g., 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2502.12632

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty

Hahn, Meera, Zeng, Wenjun, Kannen, Nithish, Galt, Rich, Badola, Kartikeya, Kim, Been, Wang, Zi

arXiv.org Artificial IntelligenceDec-9-2024

User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.06771

Country:

Asia (0.67)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)

Genre: Research Report (0.51)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.48)

Add feedback

VideoPoet: A Large Language Model for Zero-Shot Video Generation

Kondratyuk, Dan, Yu, Lijun, Gu, Xiuye, Lezama, José, Huang, Jonathan, Hornung, Rachel, Adam, Hartwig, Akbari, Hassan, Alon, Yair, Birodkar, Vighnesh, Cheng, Yong, Chiu, Ming-Chang, Dillon, Josh, Essa, Irfan, Gupta, Agrim, Hahn, Meera, Hauth, Anja, Hendon, David, Martinez, Alonso, Minnen, David, Ross, David, Schindler, Grant, Sirotenko, Mikhail, Sohn, Kihyuk, Somandepalli, Krishna, Wang, Huisheng, Yan, Jimmy, Yang, Ming-Hsuan, Yang, Xuan, Seybold, Bryan, Jiang, Lu

arXiv.org Artificial IntelligenceDec-21-2023

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.14125

Country: North America > United States (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Media (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Photorealistic Video Generation with Diffusion Models

Gupta, Agrim, Yu, Lijun, Sohn, Kihyuk, Gu, Xiuye, Hahn, Meera, Fei-Fei, Li, Essa, Irfan, Jiang, Lu, Lezama, José

arXiv.org Artificial IntelligenceDec-11-2023

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2312.06662

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model

Hahn, Meera, Raj, Amit, Rehg, James M.

arXiv.org Artificial IntelligenceNov-30-2023

The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e.g. `walk down the hallway and turn left at the piano'). For agents to complete this task successfully, they must be able to ground objects referenced into the instruction (e.g.`piano') into the visual scene as well as ground directional phrases (e.g.`turn left') into actions. In this work we ask the following question -- to what degree are spatial and directional language cues informing the navigation model's decisions? We propose a series of simple masking experiments to inspect the model's reliance on different parts of the instruction. Surprisingly we uncover that certain top performing models rely only on the noun tokens of the instructions. We propose two training methods to alleviate this concerning limitation.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2312.00151

Country: Europe > United Kingdom (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)

Add feedback

SiRoK: Situated Robot Knowledge - Understanding the Balance Between Situated Knowledge and Variability

Daruna, Angel Andres (Institute for Robotics and Intelligent Machines, Georgia Institute of Technology) | Chu, Vivian (Institute for Robotics and Intelligent Machines, Georgia Institute of Technology) | Liu, Weiyu (Institute for Robotics and Intelligent Machines, Georgia Institute of Technology) | Hahn, Meera (Institute for Robotics and Intelligent Machines, Georgia Institute of Technology) | Khante, Priyanka (The University of Texas at Austin) | Chernova, Sonia (Institute for Robotics and Intelligent Machines, Georgia Institute of Technology) | Thomaz, Andrea (The University of Texas at Austin)

AAAI ConferencesMar-21-2018

General-purpose robots operating in a variety of environments, such as homes or hospitals, require a way to integrate abstract knowledge that is generalizable across domains with local, domain-specific observations. In this work, we examine different types and sources of data, with the goal of understanding how locally observed data and abstract knowledge might be fused.We introduce the Situated Robot Knowledge (SiRoK) framework that integrates probabilistic abstract knowledge and semantic memory of the local environment. In a series of robot and simulation experiments we examine the tradeoffs in the reliability and generalization of both data sources. Our robot experiments show that the variability of object properties and locations in our knowledge base is indicative of the time it takes to generalize a concept and its validity in the real world. The results of our simulations back that of our robot experiments, and give us insights into which source of knowledge to use for 31 types of object classes that exist in the real world.

sirok, situated knowledge and variability, situated robot knowledge

AAAI Conferences

2018 AAAI Spring Symposium Series

Industry: Health & Medicine (0.53)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback