AITopics | Manocha, Dinesh

Plotting

Manocha, Dinesh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

GND: Global Navigation Dataset with Multi-Modal Perception and Multi-Category Traversability in Outdoor Campus Environments

Liang, Jing, Das, Dibyendu, Song, Daeun, Shuvo, Md Nahid Hasan, Durrani, Mohammad, Taranath, Karthik, Penskiy, Ivan, Manocha, Dinesh, Xiao, Xuesu

arXiv.org Artificial IntelligenceSep-26-2024

Navigating large-scale outdoor environments requires complex reasoning in terms of geometric structures, environmental semantics, and terrain characteristics, which are typically captured by onboard sensors such as LiDAR and cameras. While current mobile robots can navigate such environments using pre-defined, high-precision maps based on hand-crafted rules catered for the specific environment, they lack commonsense reasoning capabilities that most humans possess when navigating unknown outdoor spaces. To address this gap, we introduce the Global Navigation Dataset (GND), a large-scale dataset that integrates multi-modal sensory data, including 3D LiDAR point clouds and RGB and 360-degree images, as well as multi-category traversability maps (pedestrian walkways, vehicle roadways, stairs, off-road terrain, and obstacles) from ten university campuses. These environments encompass a variety of parks, urban settings, elevation changes, and campus layouts of different scales. The dataset covers approximately 2.7km2 and includes at least 350 buildings in total. We also present a set of novel applications of GND to showcase its utility to enable global robot navigation, such as map-based global navigation, mapless navigation, and global place recognition.

artificial intelligence, navigation, survey article, (17 more...)

arXiv.org Artificial Intelligence

2409.14262

Country: North America > United States (0.28)

Genre:

Overview (0.48)
Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.69)
Information Technology > Artificial Intelligence > Robots > Locomotion (0.51)

Add feedback

VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

Song, Daeun, Liang, Jing, Payandeh, Amirreza, Xiao, Xuesu, Manocha, Dinesh

arXiv.org Artificial IntelligenceJul-7-2024

We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior. VLM-Social-Nav uses a VLM-based scoring module that computes a cost term that ensures socially appropriate and effective robot actions generated by the underlying planner. Our overall approach reduces reliance on large training datasets and enhances adaptability in decision-making. In practice, it results in improved socially compliant navigation in human-shared environments. We demonstrate and evaluate our system in four different real-world social navigation scenarios with a Turtlebot robot. We observe at least 27.38% improvement in the average success rate and 19.05% improvement in the average collision rate in the four social navigation scenarios. Our user study score shows that VLM-Social-Nav generates the most socially compliant navigation behavior.

large language model, natural language, navigation, (17 more...)

arXiv.org Artificial Intelligence

2404.0021

Country: North America > United States > Maryland (0.14)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.71)

Industry: Transportation (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

Chowdhury, Sanjoy, Nag, Sayan, Dasgupta, Subhrajyoti, Chen, Jun, Elhoseiny, Mohamed, Gao, Ruohan, Manocha, Dinesh

arXiv.org Artificial IntelligenceJul-3-2024

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2407.01851

Country:

North America > United States > Maryland (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment > Sports (1.00)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Embodied Question Answering via Multi-LLM Systems

Patel, Bhrij, Dorbala, Vishnu Sashank, Manocha, Dinesh, Bedi, Amrit Singh

arXiv.org Artificial IntelligenceJun-25-2024

Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.10918

Country: North America > United States > Maryland (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.88)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.68)

Add feedback

Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

Patel, Bhrij, Suttle, Wesley A., Koppel, Alec, Aggarwal, Vaneet, Sadler, Brian M., Bedi, Amrit Singh, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-20-2024

In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications. To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2403.11925

Country: North America > United States > Maryland (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Ghosal, Soumya Suvra, Basu, Samyadeep, Feizi, Soheil, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-19-2024

Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging information from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a "green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation, we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-the-art prompt tuning frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.13683

Country:

North America > United States > Maryland (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Transportation (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Ghosh, Sreyan, Kumar, Sonal, Seth, Ashish, Evuru, Chandra Kiran Reddy, Tyagi, Utkarsh, Sakshi, S, Nieto, Oriol, Duraiswami, Ramani, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-17-2024

Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2406.11768

Country:

Europe (0.28)
North America > United States > Maryland (0.14)

Genre: Research Report (0.81)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

Wu, Xiyang, Chakraborty, Souradip, Xian, Ruiqi, Liang, Jing, Guan, Tianrui, Liu, Fuxiao, Sadler, Brian M., Manocha, Dinesh, Bedi, Amrit Singh

arXiv.org Artificial IntelligenceJun-16-2024

In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplored yet extremely critical. LLMs and VLMs are highly susceptible to adversarial inputs, prompting a significant inquiry into the safety of robotic systems. This concern is important because robotics operate in the physical world where erroneous actions can result in severe consequences. This paper explores this issue thoroughly, presenting a mathematical formulation of potential attacks on LLM/VLM-based robotic systems and offering experimental evidence of the safety challenges. Our empirical findings highlight a significant vulnerability: simple modifications to the input can drastically reduce system effectiveness. Specifically, our results demonstrate an average performance deterioration of 19.4% under minor input prompt modifications and a more alarming 29.1% under slight perceptual changes. These findings underscore the urgent need for robust countermeasures to ensure the safe and reliable deployment of advanced LLM/VLM-based robotic systems.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.1034

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Florida > Orange County > Orlando (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Wu, Xiyang, Guan, Tianrui, Li, Dianqi, Huang, Shuaiyi, Liu, Xiaoyu, Wang, Xijun, Xian, Ruiqi, Shrivastava, Abhinav, Huang, Furong, Boyd-Graber, Jordan Lee, Zhou, Tianyi, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-16-2024

Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.109

Country:

North America > United States > Maryland (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Chowdhury, Sanjoy, Nag, Sayan, Joseph, K J, Srinivasan, Balaji Vasan, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-7-2024

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic, yet relatively under-explored research area.

artificial intelligence, diffusion model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2406.04673

Country:

North America > United States > Maryland (0.14)
North America > United States > Louisiana (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback