Goto

Collaborating Authors

 Heck, Larry


SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

arXiv.org Artificial Intelligence

Natural language interaction with sensing systems is crucial for enabling all users to comprehend sensor data and its impact on their everyday lives. However, existing systems, which typically operate in a Question Answering (QA) manner, are significantly limited in terms of the duration and complexity of sensor data they can handle. In this work, we introduce SensorChat, the first end-to-end QA system designed for long-term sensor monitoring with multimodal and high-dimensional data including time series. SensorChat effectively answers both qualitative (requiring high-level reasoning) and quantitative (requiring accurate responses derived from sensor data) questions in real-world scenarios. To achieve this, SensorChat uses an innovative three-stage pipeline that includes question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) for intuitive human interactions and to guide the sensor data query process. Unlike existing multimodal LLMs, SensorChat incorporates an explicit query stage to precisely extract factual information from long-duration sensor data. We implement SensorChat and demonstrate its capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves up to 26% higher answer accuracy than state-of-the-art systems on quantitative questions. Additionally, a user study with eight volunteers highlights SensorChat's effectiveness in handling qualitative and open-ended questions.


SensorQA: A Question Answering Benchmark for Daily-Life Monitoring

arXiv.org Artificial Intelligence

With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: \url{https://github.com/benjamin-reichman/SensorQA}.


Reading with Intent -- Neutralizing Intent

arXiv.org Artificial Intelligence

Queries to large language models (LLMs) can be divided into two parts: the instruction/question and the accompanying context. The context for retrieval-augmented generation (RAG) systems in most benchmarks comes from Wikipedia or Wikipedia-like texts which are written in a neutral and factual tone. However, when RAG systems retrieve internet-based content, they encounter text with diverse tones and linguistic styles, introducing challenges for downstream tasks. The Reading with Intent task addresses this issue by evaluating how varying tones in context passages affect model performance. Building on prior work that focused on sarcasm, we extend this paradigm by constructing a dataset where context passages are transformed to $11$ distinct emotions using a better synthetic data generation approach. Using this dataset, we train an emotion translation model to systematically adapt passages to specified emotional tones. The human evaluation shows that the LLM fine-tuned to become the emotion-translator benefited from the synthetically generated data. Finally, the emotion-translator is used in the Reading with Intent task to transform the passages to a neutral tone. By neutralizing the passages, it mitigates the challenges posed by sarcastic passages and improves overall results on this task by about $3\%$.


LEGO: Language Model Building Blocks

arXiv.org Artificial Intelligence

Large language models (LLMs) are essential in natural language processing (NLP) but are costly in data collection, pre-training, fine-tuning, and inference. Task-specific small language models (SLMs) offer a cheaper alternative but lack robustness and generalization. This paper proposes LEGO, a novel technique to extract SLMs from an LLM and recombine them. Using state-of-the-art LLM pruning strategies, we can create task- and user-specific SLM building blocks that are efficient for fine-tuning and inference while also preserving user data privacy. LEGO utilizes Federated Learning and a novel aggregation scheme for the LLM reconstruction, maintaining robustness without high costs and preserving user data privacy. We experimentally demonstrate the versatility of LEGO, showing its ability to enable model heterogeneity and mitigate the effects of data heterogeneity while maintaining LLM robustness.


Allo-AVA: A Large-Scale Multimodal Conversational AI Dataset for Allocentric Avatar Gesture Animation

arXiv.org Artificial Intelligence

The scarcity of high-quality, multimodal training data severely hinders the creation of lifelike avatar animations for conversational AI in virtual environments. Existing datasets often lack the intricate synchronization between speech, facial expressions, and body movements that characterize natural human communication. To address this critical gap, we introduce Allo-AVA, a large-scale dataset specifically designed for text and audio-driven avatar gesture animation in an allocentric (third person point-of-view) context. Allo-AVA consists of $\sim$1,250 hours of diverse video content, complete with audio, transcripts, and extracted keypoints. Allo-AVA uniquely maps these keypoints to precise timestamps, enabling accurate replication of human movements (body and facial gestures) in synchronization with speech. This comprehensive resource enables the development and evaluation of more natural, context-aware avatar animation models, potentially transforming applications ranging from virtual reality to digital assistants.


Large Body Language Models

arXiv.org Artificial Intelligence

As virtual agents become increasingly prevalent in human-computer interaction, generating realistic and contextually appropriate gestures in real-time remains a significant challenge. While neural rendering techniques have made substantial progress with static scripts, their applicability to human-computer interactions remains limited. To address this, we introduce Large Body Language Models (LBLMs) and present LBLM-AVA, a novel LBLM architecture that combines a Transformer-XL large language model with a parallelized diffusion model to generate human-like gestures from multimodal inputs (text, audio, and video). LBLM-AVA incorporates several key components enhancing its gesture generation capabilities, such as multimodal-to-pose embeddings, enhanced sequence-to-sequence mapping with redefined attention mechanisms, a temporal smoothing module for gesture sequence coherence, and an attention-based refinement module for enhanced realism. The model is trained on our large-scale proprietary open-source dataset Allo-AVA. LBLM-AVA achieves state-of-the-art performance in generating lifelike and contextually appropriate gestures with a 30% reduction in Fr\'echet Gesture Distance (FGD), and a 25% improvement in Fr\'echet Inception Distance compared to existing approaches.


Reinforcement Learning via Auxiliary Task Distillation

arXiv.org Artificial Intelligence

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves $2.3 \times$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.


cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

arXiv.org Artificial Intelligence

An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.


iTBLS: A Dataset of Interactive Conversations Over Tabular Information

arXiv.org Artificial Intelligence

This paper introduces Interactive Tables (iTBLS), a dataset of interactive conversations situated in tables from scientific articles. This dataset is designed to facilitate human-AI collaborative problem-solving through AI-powered multi-task tabular capabilities. In contrast to prior work that models interactions as factoid QA or procedure synthesis, iTBLS broadens the scope of interactions to include mathematical reasoning, natural language manipulation, and expansion of existing tables from natural language conversation by delineating interactions into one of three tasks: interpretation, modification, or generation. Additionally, the paper presents a suite of baseline approaches to iTBLS, utilizing zero-shot prompting and parameter-efficient fine-tuning for different computing situations. We also introduce a novel multi-step approach and show how it can be leveraged in conjunction with parameter-efficient fine-tuning to achieve the state-of-the-art on iTBLS; outperforming standard parameter-efficient fine-tuning by up to 15% on interpretation, 18% on modification, and 38% on generation.


Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving?

arXiv.org Artificial Intelligence

Dense passage retrieval (DPR) is the first step in the retrieval augmented generation (RAG) paradigm for improving the performance of large language models (LLM). DPR fine-tunes pre-trained networks to enhance the alignment of the embeddings between queries and relevant textual data. A deeper understanding of DPR fine-tuning will be required to fundamentally unlock the full potential of this approach. In this work, we explore DPR-trained models mechanistically by using a combination of probing, layer activation analysis, and model editing. Our experiments show that DPR training decentralizes how knowledge is stored in the network, creating multiple access pathways to the same information. We also uncover a limitation in this training style: the internal knowledge of the pre-trained model bounds what the retrieval model can retrieve. These findings suggest a few possible directions for dense retrieval: (1) expose the DPR training process to more knowledge so more can be decentralized, (2) inject facts as decentralized representations, (3) model and incorporate knowledge uncertainty in the retrieval process, and (4) directly map internal model knowledge to a knowledge base.