Goto

Collaborating Authors

 Safavi, Tara


Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models

arXiv.org Artificial Intelligence

Accurate and interpretable user satisfaction estimation (USE) is critical for understanding, evaluating, and continuously improving conversational systems. Users express their satisfaction or dissatisfaction with diverse conversational patterns in both general-purpose (ChatGPT and Bing Copilot) and task-oriented (customer service chatbot) conversational systems. Existing approaches based on featurized ML models or text embeddings fall short in extracting generalizable patterns and are hard to interpret. In this work, we show that LLMs can extract interpretable signals of user satisfaction from their natural language utterances more effectively than embedding-based approaches. Moreover, an LLM can be tailored for USE via an iterative prompting framework using supervision from labeled examples. The resulting method, Supervised Prompting for User satisfaction Rubrics (SPUR), not only has higher accuracy but is more interpretable as it scores user satisfaction via learned rubrics with a detailed breakdown.


The Use of Generative Search Engines for Knowledge Work and Complex Tasks

arXiv.org Artificial Intelligence

Until recently, search engines were the predominant method for people to access online information. The recent emergence of large language models (LLMs) has given machines new capabilities such as the ability to generate new digital artifacts like text, images, code etc., resulting in a new tool, a generative search engine, which combines the capabilities of LLMs with a traditional search engine. Through the empirical analysis of Bing Copilot (Bing Chat), one of the first publicly available generative search engines, we analyze the types and complexity of tasks that people use Bing Copilot for compared to Bing Search. Findings indicate that people use the generative search engine for more knowledge work tasks that are higher in cognitive complexity than were commonly done with a traditional search engine.


TnT-LLM: Text Mining at Scale with Large Language Models

arXiv.org Artificial Intelligence

Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.


Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies

arXiv.org Artificial Intelligence

Log data can reveal valuable information about how users interact with Web search services, what they want, and how satisfied they are. However, analyzing user intents in log data is not easy, especially for emerging forms of Web search such as AI-driven chat. To understand user intents from log data, we need a way to label them with meaningful categories that capture their diversity and dynamics. Existing methods rely on manual or machine-learned labeling, which are either expensive or inflexible for large and dynamic datasets. We propose a novel solution using large language models (LLMs), which can generate rich and relevant concepts, descriptions, and examples for user intents. However, using LLMs to generate a user intent taxonomy and apply it for log analysis can be problematic for two main reasons: (1) such a taxonomy is not externally validated; and (2) there may be an undesirable feedback loop. To address this, we propose a new methodology with human experts and assessors to verify the quality of the LLM-generated taxonomy. We also present an end-to-end pipeline that uses an LLM with human-in-the-loop to produce, refine, and apply labels for user intent analysis in log data. We demonstrate its effectiveness by uncovering new insights into user intents from search and chat logs from the Microsoft Bing commercial search engine. The proposed work's novelty stems from the method for generating purpose-driven user intent taxonomies with strong validation. This method not only helps remove methodological and practical bottlenecks from intent-focused research, but also provides a new framework for generating, validating, and applying other kinds of taxonomies in a scalable and adaptable way with minimal human effort.


PEARL: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers

arXiv.org Artificial Intelligence

Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author's communication style and specialized knowledge. In this paper, we address this challenge by proposing PEARL, a retrieval-augmented LLM writing assistant personalized with a generation-calibrated retriever. Our retriever is trained to select historic user-authored documents for prompt augmentation, such that they are likely to best personalize LLM generations for a user request. We propose two key novelties for training our retriever: 1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and 2) A scale-calibrating KL-divergence objective that ensures that our retriever closely tracks the benefit of a document for personalized generation. We demonstrate the effectiveness of PEARL in generating personalized workplace social media posts and Reddit comments. Finally, we showcase the potential of a generation-calibrated retriever to double as a performance predictor and further improve low-quality generations via LLM chaining.


S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs

arXiv.org Artificial Intelligence

The traditional Dialogue State Tracking (DST) problem aims to track user preferences and intents in user-agent conversations. While sufficient for task-oriented dialogue systems supporting narrow domain applications, the advent of Large Language Model (LLM)-based chat systems has introduced many real-world intricacies in open-domain dialogues. These intricacies manifest in the form of increased complexity in contextual interactions, extended dialogue sessions encompassing a diverse array of topics, and more frequent contextual shifts. To handle these intricacies arising from evolving LLM-based chat systems, we propose joint dialogue segmentation and state tracking per segment in open-domain dialogue systems. Assuming a zero-shot setting appropriate to a true open-domain dialogue system, we propose S3-DST, a structured prompting technique that harnesses Pre-Analytical Recollection, a novel grounding mechanism we designed for improving long context tracking. To demonstrate the efficacy of our proposed approach in joint segmentation and state tracking, we evaluate S3-DST on a proprietary anonymized open-domain dialogue dataset, as well as publicly available DST and segmentation datasets. Across all datasets and settings, S3-DST consistently outperforms the state-of-the-art, demonstrating its potency and robustness the next generation of LLM-based chat systems.


Relational world knowledge representation in contextual language models: A review

arXiv.org Artificial Intelligence

Relational knowledge bases (KBs) are established tools for world knowledge representation in machines. While they are advantageous for their precision and interpretability, they usually sacrifice some data modeling flexibility for these advantages because they adhere to a manually engineered schema. In this review, we take a natural language processing perspective to the limitations of KBs, examining how they may be addressed in part by training neural contextual language models (LMs) to internalize and express relational knowledge in free-text form. We propose a novel taxonomy for relational knowledge representation in contextual LMs based on the level of KB supervision provided, considering both works that probe LMs for implicit relational knowledge acquired during self-supervised pretraining on unstructured text alone, and works that explicitly supervise LMs at the level of KB entities and/or relations. We conclude that LMs and KBs are complementary representation tools, as KBs provide a high standard of factual precision which can in turn be flexibly and expressively modeled by LMs, and provide suggestions for future research in this direction.


Generating Negative Commonsense Knowledge

arXiv.org Artificial Intelligence

The acquisition of commonsense knowledge is an important open challenge in artificial intelligence. In this work-in-progress paper, we study the task of automatically augmenting commonsense knowledge bases (KBs) with novel statements. We show empirically that obtaining meaningful negative samples for the completion task is nontrivial, and propose NegatER, a framework for generating negative commonsense knowledge, to address this challenge. In our evaluation we demonstrate the intrinsic value and extrinsic utility of the knowledge generated by NegatER, opening up new avenues for future research in this direction.


CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

arXiv.org Artificial Intelligence

We present CoDEx, a set of knowledge graph Completion Datasets Extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from a popular link prediction benchmark by showing that CoDEx covers more diverse and interpretable content, and contains fewer relation patterns that can be covered by trivial frequency-based rules. Data, code, and pretrained models are available at https://github.com/tsafavi/codex.


Graph Summarization Methods and Applications: A Survey

arXiv.org Artificial Intelligence

While advances in computing resources have made processing enormous amounts of data possible, human ability to identify patterns in such data has not scaled accordingly. Efficient computational methods for condensing and simplifying data are thus becoming vital for extracting actionable insights. In particular, while data summarization techniques have been studied extensively, only recently has summarizing interconnected data, or graphs, become popular. This survey is a structured, comprehensive overview of the state-of-the-art methods for summarizing graph data. We first broach the motivation behind, and the challenges of, graph summarization. We then categorize summarization approaches by the type of graphs taken as input and further organize each category by core methodology. Finally, we discuss applications of summarization on real-world graphs and conclude by describing some open problems in the field.