key fact
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts
Gupta, Raavi, Panicker, Pranav Hari, Bhatia, Sumit, Ramakrishnan, Ganesh
Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Austria > Vienna (0.14)
- South America > Argentina (0.05)
- (9 more...)
- Research Report (1.00)
- Overview > Fact Book (0.43)
Adobe Summit Concierge Evaluation with Human in the Loop
Chen, Yiru, Fang, Sally, Harsha, Sai Sree, Luo, Dan, Muppala, Vaishnavi, Wu, Fei, Jiang, Shun, Qian, Kun, Li, Yunyao
Generative AI assistants offer significant potential to enhance productivity, streamline information access, and improve user experience in enterprise contexts. In this work, we present Summit Concierge, a domain-specific AI assistant developed for Adobe Summit. The assistant handles a wide range of event-related queries and operates under real-world constraints such as data sparsity, quality assurance, and rapid deployment. To address these challenges, we adopt a human-in-the-loop development workflow that combines prompt engineering, retrieval grounding, and lightweight human validation. We describe the system architecture, development process, and real-world deployment outcomes. Our experience shows that agile, feedback-driven development enables scalable and reliable AI assistants, even in cold-start scenarios.
- Overview (0.48)
- Research Report (0.40)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.90)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)
LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation
Yuan, Weikang, Song, Kaisong, Jiang, Zhuoren, Cao, Junjie, Zhang, Yujie, Lin, Jun, Kuang, Kun, Zhang, Ji, Liu, Xiaozhong
Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.
- Asia > China (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States (0.04)
- (4 more...)
- Law > Criminal Law (0.67)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries
Grolleau, François, Alsentzer, Emily, Keyes, Timothy, Chung, Philip, Swaminathan, Akshay, Aali, Asad, Hom, Jason, Huynh, Tridu, Lew, Thomas, Liang, April S., Chu, Weihan, Steele, Natasha Z., Lin, Christina F., Yang, Jingkun, Black, Kameron C., Ma, Stephen P., Haredasht, Fateme N., Shah, Nigam H., Schulman, Kevin, Chen, Jonathan H.
Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.
- North America > United States > California > Santa Clara County > Palo Alto (0.14)
- North America > United States > California > Santa Clara County > Stanford (0.05)
- Workflow (1.00)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.67)
ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback
Yun, Taewon, Oh, Jihwan, Min, Hyangsuk, Lee, Yuho, Bang, Jihwan, Cai, Jason, Song, Hwanjun
Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.
- North America > United States (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization
Thulke, David, Gao, Yingbo, Jalota, Rricha, Dugast, Christian, Ney, Hermann
This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.
- Asia > Thailand > Bangkok > Bangkok (0.05)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Media > Film (0.93)
- Leisure & Entertainment (0.68)
Learning to Summarize from LLM-generated Feedback
Song, Hwanjun, Yun, Taewon, Lee, Yuho, Lee, Gihun, Cai, Jason, Su, Hang
Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset will be released soon. The SummLlama3-8B model is now available at https://huggingface.co/DISLab/SummLlama3-8B.
- North America > United States (0.28)
- Europe > Finland (0.04)
- Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)
- (2 more...)
STRUX: An LLM for Decision-Making with Structured Explanations
Lu, Yiming, Hu, Yebowen, Foroosh, Hassan, Jin, Wei, Liu, Fei
Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.
- North America > United States (0.28)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (3 more...)
- Financial News (1.00)
- Overview > Fact Book (0.55)
UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs
Lee, Yuho, Yun, Taewon, Cai, Jason, Su, Hang, Song, Hwanjun
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.
CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization
Gong, Ziwei, Ai, Lin, Deshpande, Harshsaiprasad, Johnson, Alexander, Phung, Emmy, Wu, Zehui, Emami, Ahmad, Hirschberg, Julia
The rapid advancement of Large Language Models In this paper, we address this gap by developing (LLMs) has significantly influenced the field of automatic a new evaluation framework tailored specifically evaluation for text summarization. LLMs for meeting summarization.We propose offer the potential to streamline the evaluation process, CREAM (Comparison-based Reference-free Eloranked making it faster and more cost-effective compared Automatic evaluation for Meeting summarization), to traditional human evaluation (Liu et al., a novel system designed to fill the gaps in 2023; Wang et al., 2023). However, despite the specialized and customizable evaluation for meeting progress in automatic evaluation techniques, existing summaries as illustrated in Figure 1. Our research methods primarily target general-purpose summarization addresses the following key questions: tasks, which typically involve shorter, 1. Do current LLM-based automatic evaluators more straightforward text inputs, which may not work effectively for meeting summarization?
- Asia > Singapore (0.05)
- North America > United States (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (4 more...)