Goto

Collaborating Authors

 Bai, Yutong


AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

arXiv.org Artificial Intelligence

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.


Analyzing The Language of Visual Tokens

arXiv.org Artificial Intelligence

With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.


LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

arXiv.org Artificial Intelligence

Recently, instruction-tuned Large Multimodal Models (LMMs), such as InstructBLIP [1], Instruct-GPT [2], LLaVA [3, 4], PALM [5] and others have demonstrated state-of-the-art performance on a variety of vision-and-language tasks. However, existing LMMs for robotics [6, 7, 8, 9] do not always demonstrate the same success and consistency across various embodied settings. This may result from the unique challenges encountered in robotics, such as the variability of real-world environments, the differences between robots, and the need to control actions reliably. Since LMMs have been proven to be successful in part due to multimodal instruction tuning, it is natural to leverage this technique in a robotics setting as well. Here, we propose a vision-action instruction tuning method that can bridge the gap between a language model's fundamental pre-training objective--next-word prediction--and the goal of enabling the model to handle various robotics settings. In this work, we introduce our Large LAnguage model for Robotic Vision and Action (LLARVA), an open-source instruction-tuned LMM for robotic applications that can generalize efficiently across various environments and robotic configurations. Our key idea is the formulation of a novel instruction prompt that encapsulates robot type, task, scene configuration, and control regime in a natural language prefix amenable to contemporary LMMs.