AITopics | Gao, Ruohan

Plotting

Gao, Ruohan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

Chowdhury, Sanjoy, Gani, Hanan, Anand, Nishit, Nag, Sayan, Gao, Ruohan, Elhoseiny, Mohamed, Khan, Salman, Manocha, Dinesh

arXiv.org Artificial IntelligenceMar-29-2025

Recent advancements in reasoning optimization have greatly enhanced the performance of large language models (LLMs). However, existing work fails to address the complexities of audio-visual scenarios, underscoring the need for further research. In this paper, we introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge. Evaluating 18 AVLLMs on AVReasonBench reveals significant limitations in their multi-modal reasoning capabilities. Using AURELIA, we achieve up to a 100% relative improvement, demonstrating its effectiveness. This performance gain highlights the potential of reasoning-enhanced data generation for advancing AVLLMs in real-world applications. Our code and data will be publicly released at: https: //github.com/schowdhury671/aurelia.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2503.23219

Country: North America (0.45)

Genre:

Overview (0.93)
Workflow (0.67)
Research Report (0.64)

Industry:

Media > Music (0.67)
Leisure & Entertainment (0.67)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Chowdhury, Sanjoy, Nag, Sayan, Dasgupta, Subhrajyoti, Wang, Yaoting, Elhoseiny, Mohamed, Gao, Ruohan, Manocha, Dinesh

arXiv.org Artificial IntelligenceJan-3-2025

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.02135

Country: North America > Canada (0.28)

Genre: Research Report > New Finding (0.92)

Industry:

Media (0.46)
Information Technology > Security & Privacy (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Chowdhury, Sanjoy, Nag, Sayan, Dasgupta, Subhrajyoti, Chen, Jun, Elhoseiny, Mohamed, Gao, Ruohan, Manocha, Dinesh

arXiv.org Artificial IntelligenceJul-3-2024

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2407.01851

Country:

North America > United States > Maryland (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Sports (1.00)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Hearing Anything Anywhere

Wang, Mason, Sawata, Ryosuke, Clarke, Samuel, Gao, Ruohan, Wu, Shangzhe, Wu, Jiajun

arXiv.org Artificial IntelligenceJun-11-2024

Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.

artificial intelligence, machine learning, rir, (17 more...)

arXiv.org Artificial Intelligence

2406.07532

Country:

Europe (0.46)
North America > United States > Maryland (0.14)

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

NOIR: Neural Signal Operated Intelligent Robots for Everyday Activities

Zhang, Ruohan, Lee, Sharon, Hwang, Minjune, Hiranaka, Ayano, Wang, Chen, Ai, Wensi, Tan, Jin Jie Ryan, Gupta, Shreya, Hao, Yilun, Levine, Gabrael, Gao, Ruohan, Norcia, Anthony, Fei-Fei, Li, Wu, Jiajun

arXiv.org Artificial IntelligenceNov-2-2023

We present Neural Signal Operated Intelligent Robots (NOIR), a general-purpose, intelligent brain-robot interface system that enables humans to command robots to perform everyday activities through brain signals. Through this interface, humans communicate their intended objects of interest and actions to the robots using electroencephalography (EEG). Our novel system demonstrates success in an expansive array of 20 challenging, everyday household activities, including cooking, cleaning, personal care, and entertainment. The effectiveness of the system is improved by its synergistic integration of robot learning algorithms, allowing for NOIR to adapt to individual users and predict their intentions. Our work enhances the way humans interact with robots, replacing traditional channels of interaction with direct, neural communication. Project website: https://noir-corl.github.io/.

artificial intelligence, robot, survey article, (17 more...)

arXiv.org Artificial Intelligence

2311.01454

Country:

North America > United States (0.14)
Asia > Japan (0.14)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Sonicverse: A Multisensory Simulation Platform for Embodied Household Agents that See and Hear

Gao, Ruohan, Li, Hao, Dharan, Gokul, Wang, Zhuzhu, Li, Chengshu, Xia, Fei, Savarese, Silvio, Fei-Fei, Li, Wu, Jiajun

arXiv.org Artificial IntelligenceSep-16-2023

Developing embodied agents in simulation has been a key research topic in recent years. Exciting new tasks, algorithms, and benchmarks have been developed in various simulators. However, most of them assume deaf agents in silent environments, while we humans perceive the world with multiple senses. We introduce Sonicverse, a multisensory simulation platform with integrated audio-visual simulation for training household agents that can both see and hear. Sonicverse models realistic continuous audio rendering in 3D environments in real-time. Together with a new audio-visual VR interface that allows humans to interact with agents with audio, Sonicverse enables a series of embodied AI tasks that need audio-visual perception. For semantic audio-visual navigation in particular, we also propose a new multi-task learning model that achieves state-of-the-art performance. In addition, we demonstrate Sonicverse's realism via sim-to-real transfer, which has not been achieved by other simulators: an agent trained in Sonicverse can successfully perform audio-visual navigation in real-world environments. Sonicverse is available at: https://github.com/StanfordVL/Sonicverse.

artificial intelligence, machine learning, navigation, (17 more...)

arXiv.org Artificial Intelligence

2306.00923

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (0.96)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.35)

Add feedback

The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

Gao, Ruohan, Dou, Yiming, Li, Hao, Agarwal, Tanmay, Bohg, Jeannette, Li, Yunzhu, Fei-Fei, Li, Wu, Jiajun

arXiv.org Artificial IntelligenceJun-1-2023

We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. We conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu

artificial intelligence, machine learning, modality, (17 more...)

arXiv.org Artificial Intelligence

2306.00956

Country: North America > United States > California > Santa Clara County > Palo Alto (0.24)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Robots > Manipulation (0.68)

Add feedback

An Extensible Multimodal Multi-task Object Dataset with Materials

Standley, Trevor, Gao, Ruohan, Chen, Dawn, Wu, Jiajun, Savarese, Silvio

arXiv.org Artificial IntelligenceApr-29-2023

We present EMMa, an Extensible, Multimodal dataset of Amazon product listings that contains rich Material annotations. It contains more than 2.8 million objects, each with image(s), listing text, mass, price, product ratings, and position in Amazon's product-category taxonomy. Objects are annotated with one or more materials from this taxonomy. With the numerous attributes available for each object, we develop a Smart Labeling framework to quickly add new binary labels to all objects with very little manual labeling effort, making the dataset extensible. Each object attribute in our dataset can be included in either the model inputs or outputs, leading to combinatorial possibilities in task configurations. For example, we can train a model to predict the object category from the listing text, or the mass and price from the product listing image. EMMa offers a new benchmark for multi-task learning in computer vision and NLP, and allows practitioners to efficiently add new tasks and object attributes at scale. Perhaps the biggest problem faced by machine learning practitioners today is that of producing labeled datasets for their specific needs. Manually labeling large amounts of data is time-consuming and costly (Deng et al., 2009; Lin et al., 2014; Kuznetsova et al., 2020). Furthermore, it is often not possible to communicate how numerous ambiguous corner cases should be handled (e.g., is a hole puncher "sharp"?) to the human annotators we typically rely on to produce these labels. Could we solve this problem with the aid of machine learning? We hypothesized that we could accurately add new properties to every instance in a semi-automated fashion if given a rich dataset with substantial information about every instance. Consequently, we developed EMMa, a large, object-centric, multimodal, and multi-task dataset. We show that EMMa can be easily extended to contain any number of new object labels using a Smart Labeling technique we developed for large multi-task and multimodal datasets. Multi-task datasets contain labels for more than one attribute for each instance, whereas multimodal datasets contain data from more than one modality, such as images, text, audio, and tabular data. Derived from Amazon product listings, EMMa contains images, text, and a number of useful attributes, such as materials, mass, price, product category, and product ratings. Each attribute can be used as either a model input or a model output.

artificial intelligence, machine learning, object-oriented architecture, (18 more...)

arXiv.org Artificial Intelligence

2305.14352

Country: North America > United States (0.93)

Genre: Research Report (0.82)

Industry:

Media (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (0.93)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.46)

Add feedback

Differentiable Physics Simulation of Dynamics-Augmented Neural Objects

Cleac'h, Simon Le, Yu, Hong-Xing, Guo, Michelle, Howell, Taylor A., Gao, Ruohan, Wu, Jiajun, Manchester, Zachary, Schwager, Mac

arXiv.org Artificial IntelligenceMar-13-2023

We present a differentiable pipeline for simulating the motion of objects that represent their geometry as a continuous density field parameterized as a deep network. This includes Neural Radiance Fields (NeRFs), and other related models. From the density field, we estimate the dynamical properties of the object, including its mass, center of mass, and inertia matrix. We then introduce a differentiable contact model based on the density field for computing normal and friction forces resulting from collisions. This allows a robot to autonomously build object models that are visually and \emph{dynamically} accurate from still images and videos of objects in motion. The resulting Dynamics-Augmented Neural Objects (DANOs) are simulated with an existing differentiable simulation engine, Dojo, interacting with other standard simulation objects, such as spheres, planes, and robots specified as URDFs. A robot can use this simulation to optimize grasps and manipulation trajectories of neural objects, or to improve the neural object models through gradient-based real-to-simulation transfer. We demonstrate the pipeline to learn the coefficient of friction of a bar of soap from a real video of the soap sliding on a table. We also learn the coefficient of friction and mass of a Stanford bunny through interactions with a Panda robot arm from synthetic data, and we optimize trajectories in simulation for the Panda arm to push the bunny to a goal location.

artificial intelligence, density field, simulation, (18 more...)

arXiv.org Artificial Intelligence

2210.0942

Country: North America > United States (0.68)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Robot Planning & Action (0.34)

Add feedback

See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation

Li, Hao, Zhang, Yizhi, Zhu, Junzhe, Wang, Shaoxiong, Lee, Michelle A, Xu, Huazhe, Adelson, Edward, Fei-Fei, Li, Gao, Ruohan, Wu, Jiajun

arXiv.org Artificial IntelligenceDec-8-2022

Imagine you are savoring tea in a peaceful Zen garden: a robot sees your empty cup and starts pouring, hears the increase of the sound pitch as the water level rises in the cup, and feels with its fingers around the handle of the teapot to tell how much tea is left and control the pouring speed. For both humans and robots, multisensory perception with vision, audio, and touch plays a crucial role in everyday tasks: vision reliably captures the global setup, audio sends immediate alerts even for occluded events, and touch provides precise local geometry of objects that reveal their status. Though exciting progress has been made on teaching robots to tackle various tasks [1, 2, 3, 4, 5], limited prior work has combined multiple sensory modalities for robot learning. There have been some recent attempts that use audio [6, 7, 8, 9] or touch [10, 11, 12, 13, 14] in conjunction with vision for robot perception, but no prior work has simultaneously incorporated visual, acoustic, and tactile signals--three principal sensory modalities, and study their respective roles on challenging multisensory robotic manipulation tasks. We aim to demonstrate the benefit of fusing multiple sensory modalities for solving complex robotic manipulation tasks, and to provide an in-depth study of the characteristics of each modality and how they complement each other.

artificial intelligence, machine learning, modality, (14 more...)

arXiv.org Artificial Intelligence

2212.03858

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback