Goto

Collaborating Authors

 Rivkin, Dmitriy


CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

arXiv.org Artificial Intelligence

Abstract-- This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario. This paper explores the extent to which natural interaction is possible between human and robot in the context of a navigation task. We seek to answer the question: "Can a robot infer its task in a navigational context without Figure 1: CARTIER prompts an LLM with knowledge about receiving an explicit command?" Household robotic tasks are a robot's environment in order to parse user intent from often formulated using imperative commands with a template implicit, conversational queries. It then informs the robot structure that can be abstracted as "go-do" commands (go where to navigate in order to help the user.


PhotoBot: Reference-Guided Interactive Photography via Natural Language

arXiv.org Artificial Intelligence

We introduce PhotoBot, a framework for automated photo acquisition based on an interplay between high-level human language guidance and a robot photographer. We propose to communicate photography suggestions to the user via a reference picture that is retrieved from a curated gallery. We exploit a visual language model (VLM) and an object detector to characterize reference pictures via textual descriptions and use a large language model (LLM) to retrieve relevant reference pictures based on a user's language query through text-based reasoning. To correspond the reference picture and the observed scene, we exploit pre-trained features from a vision transformer capable of capturing semantic similarity across significantly varying images. Using these features, we compute pose adjustments for an RGB-D camera by solving a Perspective-n-Point (PnP) problem. We demonstrate our approach on a real-world manipulator equipped with a wrist camera. Our user studies show that photos taken by PhotoBot are often more aesthetically pleasing than those taken by users themselves, as measured by human feedback.


SAGE: Smart home Agent with Grounded Execution

arXiv.org Artificial Intelligence

The common sense reasoning abilities and vast general knowledge of Large Language Models (LLMs) make them a natural fit for interpreting user requests in a Smart Home assistant context. LLMs, however, lack specific knowledge about the user and their home limit their potential impact. SAGE (Smart Home Agent with Grounded Execution), overcomes these and other limitations by using a scheme in which a user request triggers an LLM-controlled sequence of discrete actions. These actions can be used to retrieve information, interact with the user, or manipulate device states. SAGE controls this process through a dynamically constructed tree of LLM prompts, which help it decide which action to take next, whether an action was successful, and when to terminate the process. The SAGE action set augments an LLM's capabilities to support some of the most critical requirements for a Smart Home assistant. These include: flexible and scalable user preference management ("is my team playing tonight?"), access to any smart device's full functionality without device-specific code via API reading "turn down the screen brightness on my dryer", persistent device state monitoring ("remind me to throw out the milk when I open the fridge"), natural device references using only a photo of the room ("turn on the light on the dresser"), and more. We introduce a benchmark of 50 new and challenging smart home tasks where SAGE achieves a 75% success rate, significantly outperforming existing LLM-enabled baselines (30% success rate).


ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence

arXiv.org Artificial Intelligence

Our work examines the way in which large language models can be used for robotic planning and sampling, specifically the context of automated photographic documentation. Specifically, we illustrate how to produce a photo-taking robot with an exceptional level of semantic awareness by leveraging recent advances in general purpose language (LM) and vision-language (VLM) models. Given a high-level description of an event we use an LM to generate a natural-language list of photo descriptions that one would expect a photographer to capture at the event. We then use a VLM to identify the best matches to these descriptions in the robot's video stream. The photo portfolios generated by our method are consistently rated as more appropriate to the event by human evaluators than those generated by existing methods.


Self-Supervised Transformer Architecture for Change Detection in Radio Access Networks

arXiv.org Artificial Intelligence

Radio Access Networks (RANs) for telecommunications represent large agglomerations of interconnected hardware consisting of hundreds of thousands of transmitting devices (cells). Such networks undergo frequent and often heterogeneous changes caused by network operators, who are seeking to tune their system parameters for optimal performance. The effects of such changes are challenging to predict and will become even more so with the adoption of 5G/6G networks. Therefore, RAN monitoring is vital for network operators. We propose a self-supervised learning framework that leverages self-attention and self-distillation for this task. It works by detecting changes in Performance Measurement data, a collection of time-varying metrics which reflect a set of diverse measurements of the network performance at the cell level. Experimental results show that our approach outperforms the state of the art by 4% on a real-world based dataset consisting of about hundred thousands timeseries. It also has the merits of being scalable and generalizable. This allows it to provide deep insight into the specifics of mode of operation changes while relying minimally on expert knowledge.