Plotting

 Cheng, Lu


Conformal Prediction: A Data Perspective

arXiv.org Artificial Intelligence

The recent rapid development of well-designed and powerful machine learning (ML) models has significantly transformed our lives. However, the success of these models is often evaluated based on the accuracy of their predictions, which, while important, is not sufficient in many real-world scenarios. In high-stakes applications, it is equally critical to assess the uncertainty of model outputs. Uncertainty quantification (UQ) has long been a central problem in fields like statistics and ML. Several well-established methods, such as Bayesian inference and resampling techniques, have been widely adopted to address UQ. However, Bayesian posterior intervals are only valid if the parametric assumptions of the model are correctly specified, which may not always be the case in practical applications.


DemoShapley: Valuation of Demonstrations for In-Context Learning

arXiv.org Artificial Intelligence

Large language models (LLMs) leveraging in-context learning (ICL) have set new benchmarks in few-shot learning across various tasks without needing task-specific fine-tuning. However, extensive research has demonstrated that the effectiveness of ICL is significantly influenced by the selection and ordering of demonstrations. Considering the critical role of demonstration selection in ICL, we introduce DemoShapley which is inspired by the Data Shapley valuation theorem. This approach assesses the influence of individual demonstration instances, distinguishing between those that contribute positively and those that may hinder performance. Our findings reveal that DemoShapley not only enhances model performance in terms of accuracy and fairness but also generalizes queries from domains distinct from those of the in-context demonstrations, highlighting its versatility and effectiveness in optimizing ICL demonstration selection. Last but not least, DemoShapley demonstrates its ability to aid in identifying noisy data within the demonstration set.


LLM Uncertainty Quantification through Directional Entailment Graph and Claim Level Response Augmentation

arXiv.org Artificial Intelligence

The Large language models (LLMs) have showcased superior capabilities in sophisticated tasks across various domains, stemming from basic question-answer (QA), they are nowadays used as decision assistants or explainers for unfamiliar content. However, they are not always correct due to the data sparsity in specific domain corpus, or the model's hallucination problems. Given this, how much should we trust the responses from LLMs? This paper presents a novel way to evaluate the uncertainty that captures the directional instability, by constructing a directional graph from entailment probabilities, and we innovatively conduct Random Walk Laplacian given the asymmetric property of a constructed directed graph, then the uncertainty is aggregated by the derived eigenvalues from the Laplacian process. We also provide a way to incorporate the existing work's semantics uncertainty with our proposed layer. Besides, this paper identifies the vagueness issues in the raw response set and proposes an augmentation approach to mitigate such a problem, we conducted extensive empirical experiments and demonstrated the superiority of our proposed solutions.


ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees

arXiv.org Artificial Intelligence

Uncertainty quantification (UQ) in natural language generation (NLG) tasks remains an open challenge, exacerbated by the intricate nature of the recent large language models (LLMs). This study investigates adapting conformal prediction (CP), which can convert any heuristic measure of uncertainty into rigorous theoretical guarantees by constructing prediction sets, for black-box LLMs in open-ended NLG tasks. We propose a sampling-based uncertainty measure leveraging self-consistency and develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the design of the CP algorithm. Experimental results indicate that our uncertainty measure generally surpasses prior state-of-the-art methods. Furthermore, we calibrate the prediction sets within the model's unfixed answer distribution and achieve strict control over the correctness coverage rate across 6 LLMs on 4 free-form NLG datasets, spanning general-purpose and medical domains, while the small average set size further highlights the efficiency of our method in providing trustworthy guarantees for practical open-ended NLG applications.


Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation

arXiv.org Artificial Intelligence

Mainstream LLM research has primarily focused on enhancing their generative capabilities. However, even the most advanced LLMs experience uncertainty in their outputs, often producing varied results on different runs or when faced with minor changes in input, despite no substantial change in content. Given multiple responses from the same LLM to the same input, we advocate leveraging the LLMs' discriminative capability to reduce this generative uncertainty, aiding in identifying the correct answers. Specifically, we propose and analyze three discriminative prompts: direct, inverse, and hybrid, to explore the potential of both closed-source and open-source LLMs in self-improving their generative performance on two benchmark datasets. Our insights reveal which discriminative prompt is most promising and when to use it. To our knowledge, this is the first work to systematically analyze LLMs' discriminative capacity to address generative uncertainty.


LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing

arXiv.org Artificial Intelligence

This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload? This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.


Large Language Models for Data Annotation: A Survey

arXiv.org Artificial Intelligence

Data annotation generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced Large Language Models (LLMs), exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.


Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey of Notions, Methods, and Challenges

arXiv.org Artificial Intelligence

While dynamic influential roles in high-stake domains traditionally steered fairness aligns with this concept by considering by human judgments, an extensive body of research has evolving dynamics over time (Li et al. 2023), long-term fairness brought attention to the challenges of bias and discrimination has a much broader scope. This umbrella term has different against marginalized groups (Mehrabi et al. 2021; facets, including sequential fairness (where sequential Cheng, Varshney, and Liu 2021). These issues are pervasive decisions impact fairness) and fairness over multiple time and manifest in different settings, including finance, steps, among others (as depicted in Fig:1). In this work, we legal (e.g., pretrial bail decisions), aviation, and healthcare aim to unify the different strands of literature on long-term practices, among others (Gohar et al. 2024; Barocas, Hardt, fairness under a common framework.


Deconstructing The Ethics of Large Language Models from Long-standing Issues to New-emerging Dilemmas

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved unparalleled success across diverse language modeling tasks in recent years. However, this progress has also intensified ethical concerns, impacting the deployment of LLMs in everyday contexts. This paper provides a comprehensive survey of ethical challenges associated with LLMs, from longstanding issues such as copyright infringement, systematic bias, and data privacy, to emerging problems like truthfulness and social norms. We critically analyze existing research aimed at understanding, examining, and mitigating these ethical risks. Our survey underscores integrating ethical standards and societal values into the development of LLMs, thereby guiding the development of responsible and ethically aligned language models.


Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions

arXiv.org Artificial Intelligence

The integration of Large Language Models (LLMs) into the healthcare domain has the potential to significantly enhance patient care and support through the development of empathetic, patient-facing chatbots. This study investigates an intriguing question Can ChatGPT respond with a greater degree of empathy than those typically offered by physicians? To answer this question, we collect a de-identified dataset of patient messages and physician responses from Mayo Clinic and generate alternative replies using ChatGPT. Our analyses incorporate novel empathy ranking evaluation (EMRank) involving both automated metrics and human assessments to gauge the empathy level of responses. Our findings indicate that LLM-powered chatbots have the potential to surpass human physicians in delivering empathetic communication, suggesting a promising avenue for enhancing patient care and reducing professional burnout. The study not only highlights the importance of empathy in patient interactions but also proposes a set of effective automatic empathy ranking metrics, paving the way for the broader adoption of LLMs in healthcare.