Shah, Chirag
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
Dammu, Preetam Prabhu Srikar, Naidu, Himanshu, Shah, Chirag
As question answering (QA) systems advance alongside the rapid evolution of foundation models, the need for robust, adaptable, and large-scale evaluation benchmarks becomes increasingly critical. Traditional QA benchmarks are often static and publicly available, making them susceptible to data contamination and memorization by large language models (LLMs). Consequently, static benchmarks may overestimate model generalization and hinder a reliable assessment of real-world performance. In this work, we introduce Dynamic-KGQA, a scalable framework for generating adaptive QA datasets from knowledge graphs (KGs), designed to mitigate memorization risks while maintaining statistical consistency across iterations. Unlike fixed benchmarks, Dynamic-KGQA generates a new dataset variant on every run while preserving the underlying distribution, enabling fair and reproducible evaluations. Furthermore, our framework provides fine-grained control over dataset characteristics, supporting domain-specific and topic-focused QA dataset generation. Additionally, Dynamic-KGQA produces compact, semantically coherent subgraphs that facilitate both training and evaluation of KGQA models, enhancing their ability to leverage structured knowledge effectively. To align with existing evaluation protocols, we also provide static large-scale train/test/validation splits, ensuring comparability with prior methods. By introducing a dynamic, customizable benchmarking paradigm, Dynamic-KGQA enables a more rigorous and adaptable evaluation of QA systems.
Feedback-Aware Monte Carlo Tree Search for Efficient Information Seeking in Goal-Oriented Conversations
Chopra, Harshita, Shah, Chirag
The ability to identify and acquire missing information is a critical component of effective decision making and problem solving. With the rise of conversational artificial intelligence (AI) systems, strategically formulating information-seeking questions becomes crucial and demands efficient methods to guide the search process. We introduce a novel approach to adaptive question-asking through a combination of Large Language Models (LLM) for generating questions that maximize information gain, Monte Carlo Tree Search (MCTS) for constructing and leveraging a decision tree across multiple samples, and a hierarchical feedback mechanism to learn from past interactions. We present two key innovations: (1) an adaptive MCTS algorithm that balances exploration and exploitation for efficient search over potential questions; and (2) a clustering-based feedback algorithm that leverages prior experience to guide future interactions. Each incoming sample is assigned to a cluster based on its semantic similarity with previously observed samples. Our UCT (Upper Confidence bound for Trees) formulation selects optimal questions by combining expected rewards, a function of information gain, with a cluster-specific bonus that decays with depth, to emphasize the importance of early-stage questions that have proven effective for narrowing the solution space in similar samples. Experiments across three domains, including medical diagnosis and troubleshooting, demonstrate that our method leads to an average of 12% improvement in success rates and a 10x reduction in the average number of LLM calls made per conversation for the search process, in comparison to the state of the art.
Efficient and Responsible Adaptation of Large Language Models for Robust and Equitable Top-k Recommendations
Kaur, Kirandeep, Chadha, Manya, Gupta, Vinayak, Shah, Chirag
Conventional recommendation systems (RSs) are typically optimized to enhance performance metrics uniformly across all training samples, inadvertently overlooking the needs of diverse user populations. The performance disparity among various populations can harm the model's robustness to sub-populations due to the varying user properties. While large language models (LLMs) show promise in enhancing RS performance, their practical applicability is hindered by high costs, inference latency, and degraded performance on long user queries. To address these challenges, we propose a hybrid task allocation framework designed to promote social good by equitably serving all user groups. By adopting a two-phase approach, we promote a strategic assignment of tasks for efficient and responsible adaptation of LLMs. Our strategy works by first identifying the weak and inactive users that receive a suboptimal ranking performance by RSs. Next, we use an in-context learning approach for such users, wherein each user interaction history is contextualized as a distinct ranking task. We evaluate our hybrid framework by incorporating eight different recommendation algorithms and three different LLMs -- both open and close-sourced. Our results on three real-world datasets show a significant reduction in weak users and improved robustness to subpopulations without disproportionately escalating costs.
Agents Are Not Enough
Shah, Chirag, White, Ryen W.
In the midst of the growing integration of Artificial Intelligence (AI) into various aspects of our lives, agents are experiencing a resurgence. These autonomous programs that act on behalf of humans are neither new nor exclusive to the mainstream AI movement. By exploring past incarnations of agents, we can understand what has been done previously, what worked, and more importantly, what did not pan out and why. This understanding lets us to examine what distinguishes the current focus on agents. While generative AI is appealing, this technology alone is insufficient to make new generations of agents more successful. To make the current wave of agents effective and sustainable, we envision an ecosystem that includes not only agents but also Sims, which represent user preferences and behaviors, as well as Assistants, which directly interact with the user and coordinate the execution of user tasks with the help of the agents.
How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
Kharchenko, Julia, Roosta, Tanya, Chadha, Aman, Shah, Chirag
Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user's known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions -- a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs' cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.
Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
Amirizaniani, Maryam, Martin, Elias, Sivachenko, Maryna, Mashhadi, Afra, Shah, Chirag
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts, which is vital for guiding one's own thought processes. Although large language models (LLMs) excel in tasks such as summarization, question answering, and translation, they still face challenges with ToM reasoning, especially in open-ended questions. Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit's ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning. This research highlights the deficiencies in LLMs' social reasoning and demonstrates how integrating human intentions and emotions can boost their effectiveness.
Panmodal Information Interaction
Shah, Chirag, White, Ryen W.
The emergence of generative artificial intelligence (GenAI) is transforming information interaction. For decades, search engines such as Google and Bing have been the primary means of locating relevant information for the general population. They have provided search results in the same standard format (the so-called "10 blue links"). The recent ability to chat via natural language with AI-based agents and have GenAI automatically synthesize answers in real-time (grounded in top-ranked results) is changing how people interact with and consume information at massive scale. These two information interaction modalities (traditional search and AI-powered chat) coexist in current search engines, either loosely coupled (e.g., as separate options/tabs) or tightly coupled (e.g., integrated as a chat answer embedded directly within a traditional search result page). We believe that the existence of these two different modalities, and potentially many others, is creating an opportunity to re-imagine the search experience, capitalize on the strengths of many modalities, and develop systems and strategies to support seamless flow between them. We refer to these as panmodal experiences. Unlike monomodal experiences, where only one modality is available and/or used for the task at hand, panmodal experiences make multiple modalities available to users (multimodal), directly support transitions between modalities (crossmodal), and seamlessly combine modalities to tailor task assistance (transmodal). While our focus is search and chat, with learnings from insights from a survey of over 100 individuals who have recently performed common tasks on these two modalities, we also present a more general vision for the future of information interaction using multiple modalities and the emergent capabilities of GenAI.
The Use of Generative Search Engines for Knowledge Work and Complex Tasks
Suri, Siddharth, Counts, Scott, Wang, Leijie, Chen, Chacha, Wan, Mengting, Safavi, Tara, Neville, Jennifer, Shah, Chirag, White, Ryen W., Andersen, Reid, Buscher, Georg, Manivannan, Sathish, Rangan, Nagu, Yang, Longqi
Until recently, search engines were the predominant method for people to access online information. The recent emergence of large language models (LLMs) has given machines new capabilities such as the ability to generate new digital artifacts like text, images, code etc., resulting in a new tool, a generative search engine, which combines the capabilities of LLMs with a traditional search engine. Through the empirical analysis of Bing Copilot (Bing Chat), one of the first publicly available generative search engines, we analyze the types and complexity of tasks that people use Bing Copilot for compared to Bing Search. Findings indicate that people use the generative search engine for more knowledge work tasks that are higher in cognitive complexity than were commonly done with a traditional search engine.
TnT-LLM: Text Mining at Scale with Large Language Models
Wan, Mengting, Safavi, Tara, Jauhar, Sujay Kumar, Kim, Yujin, Counts, Scott, Neville, Jennifer, Suri, Siddharth, Shah, Chirag, White, Ryen W, Yang, Longqi, Andersen, Reid, Buscher, Georg, Joshi, Dhruv, Rangan, Nagu
Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
Developing a Framework for Auditing Large Language Models Using Human-in-the-Loop
Amirizaniani, Maryam, Yao, Jihan, Lavergne, Adrian, Okada, Elizabeth Snell, Chadha, Aman, Roosta, Tanya, Shah, Chirag
As LLMs become more pervasive across various users and scenarios, identifying potential issues when using these models becomes essential. Examples include bias, inconsistencies, and hallucination. Although auditing the LLM for these problems is desirable, it is far from being easy or solved. An effective method is to probe the LLM using different versions of the same question. This could expose inconsistencies in its knowledge or operation, indicating potential for bias or hallucination. However, to operationalize this auditing method at scale, we need an approach to create those probes reliably and automatically. In this paper we propose an automatic and scalable solution, where one uses a different LLM along with human-in-the-loop. This approach offers verifiability and transparency, while avoiding circular reliance on the same LLMs, and increasing scientific rigor and generalizability. Specifically, we present a novel methodology with two phases of verification using humans: standardized evaluation criteria to verify responses, and a structured prompt template to generate desired probes. Experiments on a set of questions from TruthfulQA dataset show that we can generate a reliable set of probes from one LLM that can be used to audit inconsistencies in a different LLM. The criteria for generating and applying auditing probes is generalizable to various LLMs regardless of the underlying structure or training mechanism.