Kakodkar, Nikhil
Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs
Lotfi, Faraz, Faraji, Farnoosh, Kakodkar, Nikhil, Manderson, Travis, Meger, David, Dudek, Gregory
This paper explores leveraging large language models for map-free off-road navigation using generative AI, reducing the need for traditional data collection and annotation. We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model (LLM) model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation. A language-driven semantic segmentation model generates text-based masks for identifying landmarks and terrain types in images. By translating 2D image points to the vehicle's motion plane using camera parameters, an MPC controller can guides the vehicle towards the desired terrain. This approach enhances adaptation to diverse environments and facilitates the use of high-level instructions for navigating complex and challenging terrains. Keywords: Constrained map-free navigation, large language models, languagedriven semantic segmentation, preferred terrains, speech instruction, adverbs.
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots
Rivkin, Dmitriy, Kakodkar, Nikhil, Hogan, Francois, Baghi, Bobak H., Dudek, Gregory
Abstract-- This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario. This paper explores the extent to which natural interaction is possible between human and robot in the context of a navigation task. We seek to answer the question: "Can a robot infer its task in a navigational context without Figure 1: CARTIER prompts an LLM with knowledge about receiving an explicit command?" Household robotic tasks are a robot's environment in order to parse user intent from often formulated using imperative commands with a template implicit, conversational queries. It then informs the robot structure that can be abstracted as "go-do" commands (go where to navigate in order to help the user.
ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence
Rivkin, Dmitriy, Dudek, Gregory, Kakodkar, Nikhil, Meger, David, Limoyo, Oliver, Liu, Xue, Hogan, Francois
Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.