Goto

Collaborating Authors

 guesser


P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records

Kashani, Naama, Cohen, Mira, Shaham, Uri

arXiv.org Artificial Intelligence

Electronic Health Records (EHRs) serve as comprehensive digital repositories of patient health information, encompassing both structured and unstructured data (Bates et al., 2014). A thorough understanding of EHR data can significantly enhance various aspects of patient care, including disease prediction, healthcare quality improvement, and resource allocation (Shickel et al., 2018; Kim et al., 2019). However, EHR data presents unique challenges: it is often high-dimensional, multimodal, sparse, and temporal (Wu et al., 2010; Menachemi and Collum, 2011; Xiao et al., 2018). Records typically include a diverse array of modalities, such as demographics, diagnoses, procedures, medications, prescriptions, radiological images, clinical notes, and laboratory results. The data is inherently sparse, as medical events occur irregularly, and sequential, as patient histories accumulate over time. To address these complexities, many approaches employ feature selection (FS) -- the process of identifying the most informative variables from high-dimensional input to improve model performance, interpretability, and robustness (Remeseiro and Bolon-Canedo, 2019; Chandrashekar and Sahin, 2014). Y et, to the best of our knowledge, existing FS methods applied to EHRs either ignore multimodality or fail to capture temporal dynamics.


GuessingGame: Measuring the Informativeness of Open-Ended Questions in Large Language Models

Hutson, Dylan, Vennemeyer, Daniel, Deshmukh, Aneesh, Zhan, Justin, Jiang, Tianyu

arXiv.org Artificial Intelligence

We introduce GuessingGame, a protocol for evaluating large language models (LLMs) as strategic question-askers in open-ended, open-domain settings. A Guesser LLM identifies a hidden object by posing free-form questions to an Oracle without predefined choices or candidate lists. To measure question quality, we propose two information gain (IG) metrics: a Bayesian method that tracks belief updates over semantic concepts using LLM-scored relevance, and an entropy-based method that filters candidates via ConceptNet. Both metrics are model-agnostic and support post hoc analysis. Across 858 games with multiple models and prompting strategies, higher IG strongly predicts efficiency: a one-standard-deviation IG increase reduces expected game length by 43\%. Prompting constraints guided by IG, such as enforcing question diversity, enable weaker models to significantly improve performance. These results show that question-asking in LLMs is both measurable and improvable, and crucial for interactive reasoning.


The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities

Lalai, Harsh Nishant, Shah, Raj Sanjay, Pei, Jiaxin, Varma, Sashank, Wang, Yi-Chia, Emami, Ali

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have been extensively tuned to mitigate explicit biases, yet they often exhibit subtle implicit biases rooted in their pre-training data. Rather than directly probing LLMs with human-crafted questions that may trigger guardrails, we propose studying how models behave when they proactively ask questions themselves. The 20 Questions game, a multi-turn deduction task, serves as an ideal testbed for this purpose. We systematically evaluate geographic performance disparities in entity deduction using a new dataset, Geo20Q+, consisting of both notable people and culturally significant objects (e.g., foods, landmarks, animals) from diverse regions. We test popular LLMs across two gameplay configurations (canonical 20-question and unlimited turns) and in seven languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish). Our results reveal geographic disparities: LLMs are substantially more successful at deducing entities from the Global North than the Global South, and the Global West than the Global East. While Wikipedia pageviews and pre-training corpus frequency correlate mildly with performance, they fail to fully explain these disparities. Notably, the language in which the game is played has minimal impact on performance gaps. These findings demonstrate the value of creative, free-form evaluation frameworks for uncovering subtle biases in LLMs that remain hidden in standard prompting setups. By analyzing how models initiate and pursue reasoning goals over multiple turns, we find geographic and cultural disparities embedded in their reasoning processes. We release the dataset (Geo20Q+) and code at https://sites.google.com/view/llmbias20q/home.


How Well Can Differential Privacy Be Audited in One Run?

Keinan, Amit, Shenfeld, Moshe, Ligett, Katrina

arXiv.org Artificial Intelligence

Recent methods for auditing the privacy of machine learning algorithms have improved computational efficiency by simultaneously intervening on multiple training examples in a single training run. Steinke et al. (2024) prove that one-run auditing indeed lower bounds the true privacy parameter of the audited algorithm, and give impressive empirical results. Their work leaves open the question of how precisely one-run auditing can uncover the true privacy parameter of an algorithm, and how that precision depends on the audited algorithm. In this work, we characterize the maximum achievable efficacy of one-run auditing and show that one-run auditing can only perfectly uncover the true privacy parameters of algorithms whose structure allows the effects of individual data elements to be isolated. Our characterization helps reveal how and when one-run auditing is still a promising technique for auditing real machine learning algorithms, despite these fundamental gaps.


Vision-Language Model Dialog Games for Self-Improvement

Konyushkova, Ksenia, Kaplanis, Christos, Cabi, Serkan, Denil, Misha

arXiv.org Artificial Intelligence

The increasing demand for high-quality, diverse training data poses a significant bottleneck in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a novel and scalable self-improvement framework for VLMs. Our approach leverages self-play between two agents engaged in a goal-oriented play centered around image identification. By filtering for successful game interactions, we automatically curate a high-quality dataset of interleaved images and text. We demonstrate that fine-tuning on this synthetic data leads to performance gains on downstream tasks and generalises across datasets. Moreover, as the improvements in the model lead to better game play, this procedure can be applied iteratively. This work paves the way for self-improving VLMs, with potential applications in various real-world scenarios especially when the high-quality multimodal data is scarce.


Improving Cooperation in Language Games with Bayesian Inference and the Cognitive Hierarchy

Bills, Joseph, Archibald, Christopher, Blaylock, Diego

arXiv.org Artificial Intelligence

In two-player cooperative games, agents can play together effectively when they have accurate assumptions about how their teammate will behave, but may perform poorly when these assumptions are inaccurate. In language games, failure may be due to disagreement in the understanding of either the semantics or pragmatics of an utterance. We model coarse uncertainty in semantics using a prior distribution of language models and uncertainty in pragmatics using the cognitive hierarchy, combining the two aspects into a single prior distribution over possible partner types. Fine-grained uncertainty in semantics is modeled using noise that is added to the embeddings of words in the language. To handle all forms of uncertainty we construct agents that learn the behavior of their partner using Bayesian inference and use this information to maximize the expected value of a heuristic function. We test this approach by constructing Bayesian agents for the game of Codenames, and show that they perform better in experiments where semantics is uncertain


Codenames as a Benchmark for Large Language Models

Stephenson, Matthew, Sidji, Matthew, Ronval, Benoît

arXiv.org Artificial Intelligence

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our results indicate that while certain LLMs perform better than others overall, different models exhibit varying emergent behaviours during gameplay and excel at specific roles. We also evaluate the performance of different combinations of LLMs when playing cooperatively together, demonstrating that LLM agents are more generalisable to a wider range of teammates than prior techniques.


PIANIST: Learning Partially Observable World Models with LLMs for Multi-Agent Decision Making

Light, Jonathan, Xing, Sixue, Liu, Yuanzhe, Chen, Weiqin, Cai, Min, Chen, Xiusi, Wang, Guanzhi, Cheng, Wei, Yue, Yisong, Hu, Ziniu

arXiv.org Artificial Intelligence

Effective extraction of the world knowledge in LLMs for complex decision-making tasks remains a challenge. We propose a framework PIANIST for decomposing the world model into seven intuitive components conducive to zero-shot LLM generation. Given only the natural language description of the game and how input observations are formatted, our method can generate a working world model for fast and efficient MCTS simulation. We show that our method works well on two different games that challenge the planning and decision making skills of the agent for both language and non-language based action taking, without any training on domain-specific training data or explicitly defined world model.


Communicate to Play: Pragmatic Reasoning for Efficient Cross-Cultural Communication in Codenames

White, Isadora, Pandey, Sashrika, Pan, Michelle

arXiv.org Artificial Intelligence

Cultural differences in common ground may result in pragmatic failure and misunderstandings during communication. We develop our method Rational Speech Acts for Cross-Cultural Communication (RSA+C3) to resolve cross-cultural differences in common ground. To measure the success of our method, we study RSA+C3 in the collaborative referential game of Codenames Duet and show that our method successfully improves collaboration between simulated players of different cultures. Our contributions are threefold: (1) creating Codenames players using contrastive learning of an embedding space and LLM prompting that are aligned with human patterns of play, (2) studying culturally induced differences in common ground reflected in our trained models, and (3) demonstrating that our method RSA+C3 can ease cross-cultural communication in gameplay by inferring sociocultural context from interaction. Our code is publicly available at github.com/icwhite/codenames.


Evaluating Dialect Robustness of Language Models via Conversation Understanding

Srirag, Dipankar, Joshi, Aditya

arXiv.org Artificial Intelligence

With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (i.e., dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of `taboo'. We formulate two evaluative tasks: target word prediction (TWP) (i.e.predict the masked target word in a conversation) and target word selection (TWS) (i.e., select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the USEng and IndEng subsets. We add two subsets: AITrans (where dialectic information is removed from IndEng) and AIGen (where LLMs are prompted to generate conversations). Our evaluation uses pre-trained and fine-tuned versions of two closed-source (GPT-4/3.5) and two open-source LLMs (Mistral and Gemma). LLMs perform significantly better for US English than Indian English for both TWP and TWS, for all settings. While GPT-based models perform the best, the comparatively smaller models work more equitably for short conversations (<8 turns). Our results on AIGen and AITrans (the best and worst-performing subset) respectively show that LLMs may learn a dialect of their own based on the composition of the training data, and that dialect robustness is indeed a challenging task. Our evaluation methodology exhibits a novel way to examine attributes of language models using pre-existing dialogue datasets.