AITopics | key fact

Collaborating Authors

key fact

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts

Gupta, Raavi, Panicker, Pranav Hari, Bhatia, Sumit, Ramakrishnan, Ganesh

arXiv.org Artificial IntelligenceNov-18-2025

Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.12236

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre:

Research Report (1.00)
Overview > Fact Book (0.43)

Industry:

Leisure & Entertainment > Sports (0.47)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

Adobe Summit Concierge Evaluation with Human in the Loop

Chen, Yiru, Fang, Sally, Harsha, Sai Sree, Luo, Dan, Muppala, Vaishnavi, Wu, Fei, Jiang, Shun, Qian, Kun, Li, Yunyao

arXiv.org Artificial IntelligenceNov-6-2025

Generative AI assistants offer significant potential to enhance productivity, streamline information access, and improve user experience in enterprise contexts. In this work, we present Summit Concierge, a domain-specific AI assistant developed for Adobe Summit. The assistant handles a wide range of event-related queries and operates under real-world constraints such as data sparsity, quality assurance, and rapid deployment. To address these challenges, we adopt a human-in-the-loop development workflow that combines prompt engineering, retrieval grounding, and lightweight human validation. We describe the system architecture, development process, and real-world deployment outcomes. Our experience shows that agile, feedback-driven development enables scalable and reliable AI assistants, even in cold-start scenarios.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.03186

Genre:

Overview (0.48)
Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.90)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

LeCoDe: A Benchmark Dataset for Interactive Legal Consultation Dialogue Evaluation

Yuan, Weikang, Song, Kaisong, Jiang, Zhuoren, Cao, Junjie, Zhang, Yujie, Lin, Jun, Kuang, Kun, Zhang, Ji, Liu, Xiaozhong

arXiv.org Artificial IntelligenceOct-24-2025

Legal consultation is essential for safeguarding individual rights and ensuring access to justice, yet remains costly and inaccessible to many individuals due to the shortage of professionals. While recent advances in Large Language Models (LLMs) offer a promising path toward scalable, low-cost legal assistance, current systems fall short in handling the interactive and knowledge-intensive nature of real-world consultations. To address these challenges, we introduce LeCoDe, a real-world multi-turn benchmark dataset comprising 3,696 legal consultation dialogues with 110,008 dialogue turns, designed to evaluate and improve LLMs' legal consultation capability. With LeCoDe, we innovatively collect live-streamed consultations from short-video platforms, providing authentic multi-turn legal consultation dialogues. The rigorous annotation by legal experts further enhances the dataset with professional insights and expertise. Furthermore, we propose a comprehensive evaluation framework that assesses LLMs' consultation capabilities in terms of (1) clarification capability and (2) professional advice quality. This unified framework incorporates 12 metrics across two dimensions. Through extensive experiments on various general and domain-specific LLMs, our results reveal significant challenges in this task, with even state-of-the-art models like GPT-4 achieving only 39.8% recall for clarification and 59% overall score for advice quality, highlighting the complexity of professional consultation scenarios. Based on these findings, we further explore several strategies to enhance LLMs' legal consultation abilities. Our benchmark contributes to advancing research in legal domain dialogue systems, particularly in simulating more real-world user-expert interactions.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.19667

Country: Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.66)

Industry:

Law > Criminal Law (0.67)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)

Add feedback

MedFactEval and MedAgentBrief: A Framework and Workflow for Generating and Evaluating Factual Clinical Summaries

Grolleau, François, Alsentzer, Emily, Keyes, Timothy, Chung, Philip, Swaminathan, Akshay, Aali, Asad, Hom, Jason, Huynh, Tridu, Lew, Thomas, Liang, April S., Chu, Weihan, Steele, Natasha Z., Lin, Christina F., Yang, Jingkun, Black, Kameron C., Ma, Stephen P., Haredasht, Fateme N., Shah, Nigam H., Schulman, Kevin, Chen, Jonathan H.

arXiv.org Artificial IntelligenceSep-9-2025

Evaluating factual accuracy in Large Language Model (LLM)-generated clinical text is a critical barrier to adoption, as expert review is unscalable for the continuous quality assurance these systems require. We address this challenge with two complementary contributions. First, we introduce MedFactEval, a framework for scalable, fact-grounded evaluation where clinicians define high-salience key facts and an "LLM Jury"--a multi-LLM majority vote--assesses their inclusion in generated summaries. Second, we present MedAgentBrief, a model-agnostic, multi-step workflow designed to generate high-quality, factual discharge summaries. To validate our evaluation framework, we established a gold-standard reference using a seven-physician majority vote on clinician-defined key facts from inpatient cases. The MedFactEval LLM Jury achieved almost perfect agreement with this panel (Cohen's kappa=81%), a performance statistically non-inferior to that of a single human expert (kappa=67%, P < 0.001). Our work provides both a robust evaluation framework (MedFactEval) and a high-performing generation workflow (MedAgentBrief), offering a comprehensive approach to advance the responsible deployment of generative AI in clinical workflows.

large language model, llm jury, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2509.05878

Country: North America > United States > California > Santa Clara County (0.29)

Genre:

Workflow (1.00)
Research Report > Experimental Study (0.68)
Research Report > New Finding (0.67)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Providers & Services (0.69)
Health & Medicine > Health Care Technology > Medical Record (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback

Yun, Taewon, Oh, Jihwan, Min, Hyangsuk, Lee, Yuho, Bang, Jihwan, Cai, Jason, Song, Hwanjun

arXiv.org Artificial IntelligenceMar-27-2025

Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2503.21332

Country:

North America > United States (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

Thulke, David, Gao, Yingbo, Jalota, Rricha, Dugast, Christian, Ney, Hermann

arXiv.org Artificial IntelligenceOct-24-2024

This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.18624

Country:

Asia > Thailand > Bangkok > Bangkok (0.05)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (0.93)
Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Learning to Summarize from LLM-generated Feedback

Song, Hwanjun, Yun, Taewon, Lee, Yuho, Lee, Gihun, Cai, Jason, Su, Hang

arXiv.org Artificial IntelligenceOct-16-2024

Developing effective text summarizers remains a challenge due to issues like hallucinations, key information omissions, and verbosity in LLM-generated summaries. This work explores using LLM-generated feedback to improve summary quality by aligning the summaries with human preferences for faithfulness, completeness, and conciseness. We introduce FeedSum, a large-scale dataset containing multi-dimensional LLM feedback on summaries of varying quality across diverse domains. Our experiments show how feedback quality, dimensionality, and granularity influence preference learning, revealing that high-quality, multi-dimensional, fine-grained feedback significantly improves summary generation. We also compare two methods for using this feedback: supervised fine-tuning and direct preference optimization. Finally, we introduce SummLlama3-8b, a model that outperforms the nearly 10x larger Llama3-70b-instruct in generating human-preferred summaries, demonstrating that smaller models can achieve superior performance with appropriate training. The full dataset will be released soon. The SummLlama3-8B model is now available at https://huggingface.co/DISLab/SummLlama3-8B.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.13116

Country:

North America > United States (0.28)
Europe > Finland (0.04)
Asia > Pakistan > Islamabad Capital Territory > Islamabad (0.04)
(2 more...)

Genre: Research Report (0.63)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

STRUX: An LLM for Decision-Making with Structured Explanations

Lu, Yiming, Hu, Yebowen, Foroosh, Hassan, Jin, Wei, Liu, Fei

arXiv.org Artificial IntelligenceOct-16-2024

Countless decisions shape our daily lives, and it is paramount to understand the how and why behind these choices. In this paper, we introduce a new LLM decision-making framework called STRUX, which enhances LLM decision-making by providing structured explanations. These include favorable and adverse facts related to the decision, along with their respective strengths. STRUX begins by distilling lengthy information into a concise table of key facts. It then employs a series of self-reflection steps to determine which of these facts are pivotal, categorizing them as either favorable or adverse in relation to a specific decision. Lastly, we fine-tune an LLM to identify and prioritize these key facts to optimize decision-making. STRUX has been evaluated on the challenging task of forecasting stock investment decisions based on earnings call transcripts and demonstrated superior performance against strong baselines. It enhances decision transparency by allowing users to understand the impact of different factors, representing a meaningful step towards practical decision-making with LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.12583

Country:

North America > United States (0.28)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
(3 more...)

Genre:

Financial News (1.00)
Overview > Fact Book (0.55)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs

Lee, Yuho, Yun, Taewon, Cai, Jason, Su, Hang, Song, Hwanjun

arXiv.org Artificial IntelligenceOct-1-2024

Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g., faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.

dimension, evaluation, faithfulness, (17 more...)

arXiv.org Artificial Intelligence

2409.19898

Country:

Atlantic Ocean > Black Sea (0.04)
Europe > Bulgaria (0.04)
Europe > Spain (0.04)
Europe > Italy (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

CREAM: Comparison-Based Reference-Free ELO-Ranked Automatic Evaluation for Meeting Summarization

Gong, Ziwei, Ai, Lin, Deshpande, Harshsaiprasad, Johnson, Alexander, Phung, Emmy, Wu, Zehui, Emami, Ahmad, Hirschberg, Julia

arXiv.org Artificial IntelligenceSep-17-2024

The rapid advancement of Large Language Models In this paper, we address this gap by developing (LLMs) has significantly influenced the field of automatic a new evaluation framework tailored specifically evaluation for text summarization. LLMs for meeting summarization.We propose offer the potential to streamline the evaluation process, CREAM (Comparison-based Reference-free Eloranked making it faster and more cost-effective compared Automatic evaluation for Meeting summarization), to traditional human evaluation (Liu et al., a novel system designed to fill the gaps in 2023; Wang et al., 2023). However, despite the specialized and customizable evaluation for meeting progress in automatic evaluation techniques, existing summaries as illustrated in Figure 1. Our research methods primarily target general-purpose summarization addresses the following key questions: tasks, which typically involve shorter, 1. Do current LLM-based automatic evaluators more straightforward text inputs, which may not work effectively for meeting summarization?

evaluation, key fact, summarization, (14 more...)

arXiv.org Artificial Intelligence

2409.10883

Country:

Asia > Singapore (0.05)
North America > United States (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)

Add feedback