Liu, Anqi
Conformal Linguistic Calibration: Trading-off between Factuality and Specificity
Jiang, Zhengping, Liu, Anqi, Van Durme, Benjamin
Language model outputs are not always reliable; this prompts research into methods for adapting model responses based on uncertainty. Common approaches include: \emph{abstention}, where models refrain from generating responses when uncertain; and \emph{linguistic calibration}, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unifying view of both approaches, Conformal Linguistic Calibration (CLC), reinterpreting linguistic calibration as answer set prediction. We begin by presenting a unified framework that connects abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation that allows for controlling the level of imprecision in model responses. Experimental results show that our method produces calibrated outputs with conformal guarantees on factual accuracy. Furthermore, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.
On the challenges of detecting MCI using EEG in the wild
Mishra, Aayush, Joffe, David, Telidevara, Sankara Surendra, Oakley, David S, Liu, Anqi
Recent studies have shown promising results in the detection of Mild Cognitive Impairment (MCI) using easily accessible Electroencephalogram (EEG) data which would help administer early and effective treatment for dementia patients. However, the reliability and practicality of such systems remains unclear. In this work, we investigate the potential limitations and challenges in developing a robust MCI detection method using two contrasting datasets: 1) CAUEEG, collected and annotated by expert neurologists in controlled settings and 2) GENEEG, a new dataset collected and annotated in general practice clinics, a setting where routine MCI diagnoses are typically made. We find that training on small datasets, as is done by most previous works, tends to produce high variance models that make overconfident predictions, and are unreliable in practice. Additionally, distribution shifts between datasets make cross-domain generalization challenging. Finally, we show that MCI detection using EEG may suffer from fundamental limitations because of the overlapping nature of feature distributions with control groups. We call for more effort in high-quality data collection in actionable settings (like general practice clinics) to make progress towards this salient goal of non-invasive MCI detection.
Training-Aware Risk Control for Intensity Modulated Radiation Therapies Quality Assurance with Conformal Prediction
He, Kevin, Adam, David, Han-Oh, Sarah, Liu, Anqi
Measurement quality assurance (QA) practices play a key role in the safe use of Intensity Modulated Radiation Therapies (IMRT) for cancer treatment. These practices have reduced measurement-based IMRT QA failure below 1%. However, these practices are time and labor intensive which can lead to delays in patient care. In this study, we examine how conformal prediction methodologies can be used to robustly triage plans. We propose a new training-aware conformal risk control method by combining the benefit of conformal risk control and conformal training. We incorporate the decision making thresholds based on the gamma passing rate, along with the risk functions used in clinical evaluation, into the design of the risk control framework. Our method achieves high sensitivity and specificity and significantly reduces the number of plans needing measurement without generating a huge confidence interval. Our results demonstrate the validity and applicability of conformal prediction methods for improving efficiency and reducing the workload of the IMRT QA process.
Interruption Handling for Conversational Robots
Cao, Shiye, Moon, Jiwon, Mahmood, Amama, Antony, Victor Nikhil, Xiao, Ziang, Liu, Anqi, Huang, Chien-Ming
Interruptions, a fundamental component of human communication, can enhance the dynamism and effectiveness of conversations, but only when effectively managed by all parties involved. Despite advancements in robotic systems, state-of-the-art systems still have limited capabilities in handling user-initiated interruptions in real-time. Prior research has primarily focused on post hoc analysis of interruptions. To address this gap, we present a system that detects user-initiated interruptions and manages them in real-time based on the interrupter's intent (i.e., cooperative agreement, cooperative assistance, cooperative clarification, or disruptive interruption). The system was designed based on interaction patterns identified from human-human interaction data. We integrated our system into an LLM-powered social robot and validated its effectiveness through a timed decision-making task and a contentious discussion task with 21 participants. Our system successfully handled 93.69% (n=104/111) of user-initiated interruptions. We discuss our learnings and their implications for designing interruption-handling behaviors in conversational robots.
Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation
Guo, Yihong, Wang, Yixuan, Shi, Yuanyuan, Xu, Pan, Liu, Anqi
Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.
Variance-Aware Linear UCB with Deep Representation for Neural Contextual Bandits
Bui, Ha Manh, Mallada, Enrique, Liu, Anqi
By leveraging the representation power of deep neural networks, neural upper confidence bound (UCB) algorithms have shown success in contextual bandits. To further balance the exploration and exploitation, we propose Neural-$\sigma^2$-LinearUCB, a variance-aware algorithm that utilizes $\sigma^2_t$, i.e., an upper bound of the reward noise variance at round $t$, to enhance the uncertainty quantification quality of the UCB, resulting in a regret performance improvement. We provide an oracle version for our algorithm characterized by an oracle variance upper bound $\sigma^2_t$ and a practical version with a novel estimation for this variance bound. Theoretically, we provide rigorous regret analysis for both versions and prove that our oracle algorithm achieves a better regret guarantee than other neural-UCB algorithms in the neural contextual bandits setting. Empirically, our practical method enjoys a similar computational efficiency, while outperforming state-of-the-art techniques by having a better calibration and lower regret across multiple standard settings, including on the synthetic, UCI, MNIST, and CIFAR-10 datasets.
Core: Robust Factual Precision Scoring with Informative Sub-Claim Identification
Jiang, Zhengping, Zhang, Jingyu, Weir, Nathaniel, Ebner, Seth, Wanner, Miriam, Sanders, Kate, Khashabi, Daniel, Liu, Anqi, Van Durme, Benjamin
Hallucinations -- the generation of untrue claims -- pose a challenge to the application of large language models (LLMs) [1] thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore [2], can be manipulated by adding obvious or repetitive claims to artificially inflate scores. We expand the FActScore dataset to design and analyze factual precision metrics, demonstrating that models can be trained to achieve high scores under existing metrics through exploiting the issues we identify. This motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. Metrics augmented by Core are substantially more robust as shown in head-to-head comparisons. We release an evaluation framework supporting the modular use of Core (https://github.com/zipJiang/Core) and various decomposition strategies, and we suggest its adoption by the LLM community. [1] Hong et al., "The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models", arXiv:2404.05904v2 [cs.CL]. [2] Min et al., "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation", arXiv:2305.14251v2 [cs.CL].
SurgicAI: A Fine-grained Platform for Data Collection and Benchmarking in Surgical Policy Learning
Wu, Jin, Zhou, Haoying, Kazanzides, Peter, Munawar, Adnan, Liu, Anqi
Despite advancements in robotic-assisted surgery, automating complex tasks like suturing remain challenging due to the need for adaptability and precision. Learning-based approaches, particularly reinforcement learning (RL) and imitation learning (IL), require realistic simulation environments for efficient data collection. However, current platforms often include only relatively simple, non-dexterous manipulations and lack the flexibility required for effective learning and generalization. We introduce SurgicAI, a novel platform for development and benchmarking addressing these challenges by providing the flexibility to accommodate both modular subtasks and more importantly task decomposition in RL-based surgical robotics. Compatible with the da Vinci Surgical System, SurgicAI offers a standardized pipeline for collecting and utilizing expert demonstrations. It supports deployment of multiple RL and IL approaches, and the training of both singular and compositional subtasks in suturing scenarios, featuring high dexterity and modularization. Meanwhile, SurgicAI sets clear metrics and benchmarks for the assessment of learned policies. We implemented and evaluated multiple RL and IL algorithms on SurgicAI. Our detailed benchmark analysis underscores SurgicAI's potential to advance policy learning in surgical robotics. Details: \url{https://github.com/surgical-robotics-ai/SurgicAI
RORA: Robust Free-Text Rationale Evaluation
Jiang, Zhengping, Lu, Yining, Chen, Hanjie, Khashabi, Daniel, Van Durme, Benjamin, Liu, Anqi
Free-text rationales play a pivotal role in explainable NLP, bridging the knowledge and reasoning gaps behind a model's decision-making. However, due to the diversity of potential reasoning paths and a corresponding lack of definitive ground truth, their evaluation remains a challenge. Existing evaluation metrics rely on the degree to which a rationale supports a target label, but we find these fall short in evaluating rationales that inadvertently leak the labels. To address this problem, we propose RORA, a Robust free-text Rationale evaluation against label leakage. RORA quantifies the new information supplied by a rationale to justify the label. This is achieved by assessing the conditional V-information \citep{hewitt-etal-2021-conditional} with a predictive family robust against leaky features that can be exploited by a small model. RORA consistently outperforms existing approaches in evaluating human-written, synthetic, or model-generated rationales, particularly demonstrating robustness against label leakage. We also show that RORA aligns well with human judgment, providing a more reliable and accurate measurement across diverse free-text rationales.
Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)
Prinster, Drew, Stanton, Samuel, Liu, Anqi, Saria, Suchi
As artificial intelligence (AI) / machine learning (ML) gain widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when such systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction is a promising approach to uncertainty and risk quantification, but prior variants' validity guarantees have assumed some form of ``quasi-exchangeability'' on the data distribution, thereby excluding many types of sequential shifts. In this paper we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones. Although the most general case is exceedingly impractical to compute, for concrete practical applications we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of AI/ML-agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.