AITopics | Guan, Tianrui

Collaborating Authors

Guan, Tianrui

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment

Ghosal, Soumya Suvra, Chakraborty, Souradip, Singh, Vaibhav, Guan, Tianrui, Wang, Mengdi, Beirami, Ahmad, Huang, Furong, Velasquez, Alvaro, Manocha, Dinesh, Bedi, Amrit Singh

arXiv.org Artificial IntelligenceDec-20-2024

With the widespread deployment of Multimodal Large Language Models (MLLMs) for visual-reasoning tasks, improving their safety has become crucial. Recent research indicates that despite training-time safety alignment, these models remain vulnerable to jailbreak attacks. In this work, we first highlight an important safety gap to describe that alignment achieved solely through safety training may be insufficient against jailbreak attacks. To address this vulnerability, we propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks. Additionally, we provide a mathematical characterization of Immune, offering provable guarantees against jailbreaks. Extensive evaluations on diverse jailbreak benchmarks using recent MLLMs reveal that Immune effectively enhances model safety while preserving the model's original capabilities. For instance, against text-based jailbreak attacks on LLaVA-1.6, Immune reduces the attack success rate by 57.82% and 16.78% compared to the base MLLM and state-of-the-art defense strategy, respectively.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2411.18688

Country: North America > United States (0.46)

Genre:

Overview (1.00)
Research Report > New Finding (0.65)

Industry:

Information Technology > Security & Privacy (0.93)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments

Elnoor, Mohamed, Weerakoon, Kasun, Seneviratne, Gershom, Xian, Ruiqi, Guan, Tianrui, Jaffar, Mohamed Khalid M, Rajagopal, Vignesh, Manocha, Dinesh

arXiv.org Artificial IntelligenceSep-30-2024

We present a novel autonomous robot navigation algorithm for outdoor environments that is capable of handling diverse terrain traversability conditions. Our approach, VLM-GroNav, uses vision-language models (VLMs) and integrates them with physical grounding that is used to assess intrinsic terrain properties such as deformability and slipperiness. We use proprioceptive-based sensing, which provides direct measurements of these physical properties, and enhances the overall semantic understanding of the terrains. Our formulation uses in-context learning to ground the VLM's semantic understanding with proprioceptive data to allow dynamic updates of traversability estimates based on the robot's real-time physical interactions with the environment. We use the updated traversability estimations to inform both the local and global planners for real-time trajectory replanning. We validate our method on a legged robot (Ghost Vision 60) and a wheeled robot (Clearpath Husky), in diverse real-world outdoor environments with different deformable and slippery terrains. In practice, we observe significant improvements over state-of-the-art methods by up to 50% increase in navigation success rate.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.20445

Country: North America > United States > Maryland (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry: Energy (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Add feedback

SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining

Xian, Ruiqi, Wu, Xiyang, Guan, Tianrui, Wang, Xijun, Gong, Boqing, Manocha, Dinesh

arXiv.org Artificial IntelligenceSep-26-2024

We introduce SOAR, a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs). We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance. This is in contrast to prior works that primarily incorporate object information during the fine-tuning stage. Specifically, we first propose a novel object-aware masking strategy designed to retain the visibility of certain patches related to objects throughout the pretraining phase. Second, we introduce an object-aware loss function that utilizes object information to adjust the reconstruction loss, preventing bias towards less informative background patches. In practice, SOAR with a vanilla ViT backbone, outperforms best UAV action recognition models, recording a 9.7% and 21.4% boost in top-1 accuracy on the NEC-Drone and UAV-Human datasets, while delivering an inference speed of 18.7ms per video, making it 2x to 5x faster. Additionally, SOAR obtains comparable accuracy to prior self-supervised learning (SSL) methods while requiring 87.5% less pretraining time and 25% less memory usage

artificial intelligence, machine learning, recognition, (18 more...)

arXiv.org Artificial Intelligence

2409.183

Country: North America > United States > Maryland (0.28)

Genre: Research Report (1.00)

Industry: Information Technology (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

Wu, Xiyang, Chakraborty, Souradip, Xian, Ruiqi, Liang, Jing, Guan, Tianrui, Liu, Fuxiao, Sadler, Brian M., Manocha, Dinesh, Bedi, Amrit Singh

arXiv.org Artificial IntelligenceJun-16-2024

In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplored yet extremely critical. LLMs and VLMs are highly susceptible to adversarial inputs, prompting a significant inquiry into the safety of robotic systems. This concern is important because robotics operate in the physical world where erroneous actions can result in severe consequences. This paper explores this issue thoroughly, presenting a mathematical formulation of potential attacks on LLM/VLM-based robotic systems and offering experimental evidence of the safety challenges. Our empirical findings highlight a significant vulnerability: simple modifications to the input can drastically reduce system effectiveness. Specifically, our results demonstrate an average performance deterioration of 19.4% under minor input prompt modifications and a more alarming 29.1% under slight perceptual changes. These findings underscore the urgent need for robust countermeasures to ensure the safe and reliable deployment of advanced LLM/VLM-based robotic systems.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2402.1034

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > Florida > Orange County > Orlando (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Wu, Xiyang, Guan, Tianrui, Li, Dianqi, Huang, Shuaiyi, Liu, Xiaoyu, Wang, Xijun, Xian, Ruiqi, Shrivastava, Abhinav, Huang, Furong, Boyd-Graber, Jordan Lee, Zhou, Tianyi, Manocha, Dinesh

arXiv.org Artificial IntelligenceJun-16-2024

Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine their validity. These motivate us to develop the first automatic benchmark generation approach, AUTOHALLUSION, that harnesses a few principal strategies to create diverse hallucination examples. It probes the language modules in LVLMs for context cues and uses them to synthesize images by: (1) adding objects abnormal to the context cues; (2) for two co-occurring objects, keeping one and excluding the other; or (3) removing objects closely tied to the context cues. It then generates image-based questions whose ground-truth answers contradict the language module's prior. A model has to overcome contextual biases and distractions to reach correct answers, while incorrect or inconsistent answers indicate hallucinations. AUTOHALLUSION enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AUTOHALLUSION, paving the way for a long battle against hallucinations.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2406.109

Country:

North America > United States > Maryland (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Guan, Tianrui, Yang, Yurou, Cheng, Harry, Lin, Muyuan, Kim, Richard, Madhivanan, Rajasimman, Sen, Arnie, Manocha, Dinesh

arXiv.org Artificial IntelligenceMay-8-2024

In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success rate, respectively.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.05363

Country: North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments

Elnoor, Mohamed, Weerakoon, Kasun, Sathyamoorthy, Adarsh Jagan, Guan, Tianrui, Rajagopal, Vignesh, Manocha, Dinesh

arXiv.org Artificial IntelligenceMar-19-2024

We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for navigation. The general knowledge map encodes terrains semantically segmented from visual sensing, and represents a terrain's typically expected traversability. The traversability history map encodes the robot's recent proprioceptive measurements on a terrain and its semantic segmentation as a cost map. Further, the robot's present proprioceptive measurement is encoded as a cost map in the current proprioception map. As the general knowledge map and traversability history map rely on semantic segmentation, we evaluate the reliability of the visual sensory data by estimating the brightness and motion blur of input RGB images and accordingly combine the three cost maps to obtain the coupled traversability cost map used for navigation. Leveraging this adaptive coupling, the robot can depend on the most reliable input modality available. Finally, we present a novel planner that selects appropriate gaits and velocities for traversing challenging outdoor environments using the coupled traversability cost map. We demonstrate AMCO's navigation performance in different real-world outdoor environments and observe 10.8%-34.9% reduction w.r.t. two stability metrics, and up to 50% improvement in terms of success rate compared to current navigation methods.

artificial intelligence, robot, terrain, (17 more...)

arXiv.org Artificial Intelligence

2403.13235

Country: Asia > India (0.14)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)

Add feedback

Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey

Liu, Xiaoyu, Xu, Paiheng, Wu, Junda, Yuan, Jiaxin, Yang, Yifan, Zhou, Yuhang, Liu, Fuxiao, Guan, Tianrui, Wang, Haoliang, Yu, Tong, McAuley, Julian, Ai, Wei, Huang, Furong

arXiv.org Artificial IntelligenceMar-14-2024

Recently Large Language Models (LLMs) have showcased remarkable versatility across a spectrum of critical tasks. An LLM is adept at tasks such as copywriting, enhancing original sentences with their distinct style and voice, responding to knowledge base queries, generating code, solving mathematical problems, and performing classification or generation tasks tailored to user requirements. Moreover, there has been a recent expansion into multi-modal variants, such as Large Visual Language Models (LVLMs) or Large Multi-modal Language Models, which broaden their input/output capabilities to encompass various modalities. This evolution has significantly enhanced both the potential and range of applications of these models. In this survey, our primary focus is on Transformer-based Large Language Models (LLMs). The capability of LLMs is fundamentally rooted in their inference abilities, which dictates their proficiency in comprehending, processing, and providing solutions to various inquiries, as well as their adaptability to societally impactful domains.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2403.09606

Country: North America > United States > California (0.28)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.68)

Industry:

Information Technology > Security & Privacy (0.46)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.46)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models

Guan, Tianrui, Liu, Fuxiao, Wu, Xiyang, Xian, Ruiqi, Li, Zongxia, Liu, Xiaoyu, Wang, Xijun, Chen, Lichang, Huang, Furong, Yacoob, Yaser, Manocha, Dinesh, Zhou, Tianyi

arXiv.org Artificial IntelligenceNov-28-2023

We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision) and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions, all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies, logical consistency, and various failure modes. In our evaluation on HallusionBench, we benchmarked 13 different models, highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably, all other evaluated models achieve accuracy below 16%. Moreover, our analysis not only highlights the observed failure modes, including language hallucination and visual illusion, but also deepens an understanding of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights, we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyi-lab/HallusionBench.

large language model, llava-1, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2310.14566

Country:

Asia (1.00)
North America > United States > Maryland (0.14)
Europe > United Kingdom > England (0.14)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (0.92)
Leisure & Entertainment > Sports (0.70)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning

Wu, Xiyang, Chandra, Rohan, Guan, Tianrui, Bedi, Amrit Singh, Manocha, Dinesh

arXiv.org Artificial IntelligenceAug-21-2023

Navigating safely and efficiently in dense and heterogeneous traffic scenarios is challenging for autonomous vehicles (AVs) due to their inability to infer the behaviors or intentions of nearby drivers. In this work, we introduce a distributed multi-agent reinforcement learning (MARL) algorithm that can predict trajectories and intents in dense and heterogeneous traffic scenarios. Our approach for intent-aware planning, iPLAN, allows agents to infer nearby drivers' intents solely from their local observations. We model two distinct incentives for agents' strategies: Behavioral Incentive for high-level decision-making based on their driving behavior or personality and Instant Incentive for motion planning for collision avoidance based on the current traffic state. Our approach enables agents to infer their opponents' behavior incentives and integrate this inferred information into their decision-making and motion-planning processes. We perform experiments on two simulation environments, Non-Cooperative Navigation and Heterogeneous Highway. In Heterogeneous Highway, results show that, compared with centralized training decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic traffic, with 48.1% higher success rate and 80.6% longer survival time in chaotic traffic. We also compare with a decentralized training decentralized execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of 12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate, and 13.7% longer survival time.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2306.06236

Country:

North America > United States > Maryland (0.14)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment > Games > Computer Games (0.66)
Transportation > Ground > Road (0.47)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
(2 more...)

Add feedback