AITopics | Manocha, Dinesh

Collaborating Authors

Manocha, Dinesh

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Ghosh, Sreyan, Kumar, Sonal, Seth, Ashish, Chiniya, Purva, Tyagi, Utkarsh, Duraiswami, Ramani, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-6-2024

Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs that is additionally equipped with lip motion cues to promote further research in this space

artificial intelligence, hypothesis, natural language, (12 more...)

arXiv.org Artificial Intelligence

2406.04432

Country: North America > United States > Maryland (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions

Ghosh, Sreyan, Tyagi, Utkarsh, Kumar, Sonal, Evuru, C. K., Ramaneswaran, S, Sakshi, S, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-6-2024

We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To learn the task of expanding abstract descriptions, we first train BART on a large-scale synthetic dataset with abstract-document pairs. Next, to generate abstract descriptions for a document, we propose a simple, controllable, and training-free method based on editing AMR graphs. ABEX brings the best of both worlds: by expanding from abstract representations, it preserves the original semantic properties of the documents, like style and meaning, thereby maintaining alignment with the original label and data distribution. At the same time, the fundamental process of elaborating on abstract descriptions facilitates diverse generations. We demonstrate the effectiveness of ABEX on 4 NLU tasks spanning 12 datasets and 4 low-resource settings. ABEX outperforms all our baselines qualitatively with improvements of 0.04% - 38.8%. Qualitatively, ABEX outperforms all prior methods from literature in terms of context and length diversity.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2406.04286

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > California (0.28)
North America > United States > New York > New York County > New York City (0.14)

Genre:

Research Report (1.00)
Personal (1.00)

Industry:

Transportation > Passenger (1.00)
Media > Music (1.00)
Leisure & Entertainment > Sports (0.93)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Transfer Q Star: Principled Decoding for LLM Alignment

Chakraborty, Souradip, Ghosal, Soumya Suvra, Yin, Ming, Manocha, Dinesh, Wang, Mengdi, Bedi, Amrit Singh, Huang, Furong

arXiv.org Artificial IntelligenceMay-30-2024

Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable framework for alignment. However, principled decoding methods rely on oracle access to an optimal Q-function ($Q^*$), which is often unavailable in practice. Hence, prior SoTA methods either approximate this $Q^*$ using $Q^{\pi_{\texttt{sft}}}$ (derived from the reference $\texttt{SFT}$ model) or rely on short-term rewards, resulting in sub-optimal decoding performance. In this work, we propose Transfer $Q^*$, which implicitly estimates the optimal value function for a target reward $r$ through a baseline model $\rho_{\texttt{BL}}$ aligned with a baseline reward $\rho_{\texttt{BL}}$ (which can be different from the target reward $r$). Theoretical analyses of Transfer $Q^*$ provide a rigorous characterization of its optimality, deriving an upper bound on the sub-optimality gap and identifying a hyperparameter to control the deviation from the pre-trained reference $\texttt{SFT}$ model based on user needs. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods and demonstrates superior empirical performance across key metrics such as coherence, diversity, and quality in extensive tests on several synthetic and real datasets.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.20495

Country: North America > United States > Maryland (0.14)

Genre: Research Report > Promising Solution (0.46)

Industry:

Government > Regional Government > North America Government > United States Government (0.67)
Government > Military (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

EM-GANSim: Real-time and Accurate EM Simulation Using Conditional GANs for 3D Indoor Scenes

Wang, Ruichen, Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-27-2024

We present a novel machine-learning (ML) approach (EM-GANSim) for real-time electromagnetic (EM) propagation that is used for wireless communication simulation in 3D indoor environments. Our approach uses a modified conditional Generative Adversarial Network (GAN) that incorporates encoded geometry and transmitter location while adhering to the electromagnetic propagation theory. The overall physically-inspired learning is able to predict the power distribution in 3D scenes, which is represented using heatmaps. Our overall accuracy is comparable to ray tracing-based EM simulation, as evidenced by lower mean squared error values. Furthermore, our GAN-based method drastically reduces the computation time, achieving a 5X speedup on complex benchmarks. In practice, it can compute the signal strength in a few milliseconds on any location in 3D indoor environments. We also present a large dataset of 3D models and EM ray tracing-simulated heatmaps. To the best of our knowledge, EM-GANSim is the first real-time algorithm for EM simulation in complex 3D indoor environments. We plan to release the code and the dataset.

machine learning, real time system, simulation, (20 more...)

arXiv.org Artificial Intelligence

2405.17366

Country: North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report (0.50)

Industry: Energy > Oil & Gas > Upstream (0.79)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

GAMEOPT+: Improving Fuel Efficiency in Unregulated Heterogeneous Traffic Intersections via Optimal Multi-agent Cooperative Control

Suriyarachchi, Nilesh, Chandra, Rohan, Anantula, Arya, Baras, John S., Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-26-2024

Better fuel efficiency leads to better financial security as well as a cleaner environment. We propose a novel approach for improving fuel efficiency in unstructured and unregulated traffic environments. Existing intelligent transportation solutions for improving fuel efficiency, however, apply only to traffic intersections with sparse traffic or traffic where drivers obey the regulations, or both. We propose GameOpt+, a novel hybrid approach for cooperative intersection control in dynamic, multi-lane, unsignalized intersections. GameOpt+ is a hybrid solution that combines an auction mechanism and an optimization-based trajectory planner. It generates a priority entrance sequence for each agent and computes velocity controls in real-time, taking less than 10 milliseconds even in high-density traffic with over 10,000 vehicles per hour. Compared to fully optimization-based methods, it operates 100 times faster while ensuring fairness, safety, and efficiency. Tested on the SUMO simulator, our algorithm improves throughput by at least 25%, reduces the time to reach the goal by at least 70%, and decreases fuel consumption by 50% compared to auction-based and signaled approaches using traffic lights and stop signs. GameOpt+ is also unaffected by unbalanced traffic inflows, whereas some of the other baselines encountered a decrease in performance in unbalanced traffic inflow environments.

artificial intelligence, machine learning, vehicle, (19 more...)

arXiv.org Artificial Intelligence

2405.1643

Country: North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report (1.00)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
Energy (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

Ghosh, Sreyan, Evuru, Chandra Kiran Reddy, Kumar, Sonal, Tyagi, Utkarsh, Nieto, Oriol, Jin, Zeyu, Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-24-2024

Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e.g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e.g., prompts that require additional skills like reasoning on contents of the image). (2) LVLMs lack visual perception, i.e., they can see but not necessarily understand or perceive the input image. We analyze responses to cognitive prompts and show that LVLMs hallucinate due to a perception gap: although LVLMs accurately recognize visual elements in the input image and possess sufficient cognitive skills, they struggle to respond accurately and hallucinate. To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations. Specifically, we first describe the image and add it as a prefix to the instruction. Next, during auto-regressive decoding, we sample from the plausible candidates according to their KL-Divergence (KLD) to the description, where lower KLD is given higher preference. Experimental results on several benchmarks and LVLMs show that VDGD improves significantly over other baselines in reducing hallucinations. We also propose VaLLu, a benchmark for the comprehensive evaluation of the cognitive capabilities of LVLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2405.15683

Country:

North America > United States (0.67)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Industry:

Health & Medicine (0.68)
Banking & Finance (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Guan, Tianrui, Yang, Yurou, Cheng, Harry, Lin, Muyuan, Kim, Richard, Madhivanan, Rajasimman, Sen, Arnie, Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-8-2024

In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.05363

Country: North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

S-EQA: Tackling Situational Queries in Embodied Question Answering

Dorbala, Vishnu Sashank, Goyal, Prasoon, Piramuthu, Robinson, Johnston, Michael, Manocha, Dinesh, Ghanadhan, Reza

arXiv.org Artificial IntelligenceMay-7-2024

We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2405.04732

Country: North America > United States > Maryland (0.14)

Genre:

Questionnaire & Opinion Survey (0.97)
Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes

Maxey, Christopher, Choi, Jaehoon, Lee, Yonghan, Lee, Hyungtae, Manocha, Dinesh, Kwon, Heesung

arXiv.org Artificial IntelligenceMay-4-2024

In this paper, we present a new approach to bridge the domain gap between synthetic and real-world data for un- manned aerial vehicle (UAV)-based perception. Our formu- lation is designed for dynamic scenes, consisting of moving objects or human actions, where the goal is to recognize the pose or actions. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as an image decoder that transforms output feature maps into RGB images. Our technique leverages the information amongst both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art aerial perception algorithms.

artificial intelligence, data mining, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2405.02762

Country:

Europe > Germany (0.14)
Asia (0.14)

Genre: Research Report (0.64)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining > Feature Extraction (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.84)

Add feedback

"Don't forget to put the milk back!" Dataset for Enabling Embodied Agents to Detect Anomalous Situations

Mullen, James F. Jr, Goyal, Prasoon, Piramuthu, Robinson, Johnston, Michael, Manocha, Dinesh, Ghanadan, Reza

arXiv.org Artificial IntelligenceApr-12-2024

Home robots intend to make their users lives easier. Our work assists in this goal by enabling robots to inform their users of dangerous or unsanitary anomalies in their home. Some examples of these anomalies include the user leaving their milk out, forgetting to turn off the stove, or leaving poison accessible to children. To move towards enabling home robots with these abilities, we have created a new dataset, which we call SafetyDetect. The SafetyDetect dataset consists of 1000 anomalous home scenes, each of which contains unsafe or unsanitary situations for an agent to detect. Our approach utilizes large language models (LLMs) alongside both a graph representation of the scene and the relationships between the objects in the scene. Our key insight is that this connected scene graph and the object relationships it encodes enables the LLM to better reason about the scene -- especially as it relates to detecting dangerous or unsanitary situations. Our most promising approach utilizes GPT-4 and pursues a categorization technique where object relations from the scene graph are classified as normal, dangerous, unsanitary, or dangerous for children. This method is able to correctly identify over 90% of anomalous scenarios in the SafetyDetect Dataset. Additionally, we conduct real world experiments on a ClearPath TurtleBot where we generate a scene graph from visuals of the real world scene, and run our approach with no modification. This setup resulted in little performance loss. The SafetyDetect Dataset and code will be released to the public upon this papers publication.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2404.08827

Country: North America > United States (0.14)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback