AITopics | Tang, Liyan

Collaborating Authors

Tang, Liyan

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Tang, Liyan, Laban, Philippe, Durrett, Greg

arXiv.org Artificial IntelligenceApr-16-2024

Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of "fact-checking" are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to LLMs to check a single response. In this work, we show how to build small models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify pre-existing datasets into a benchmark LLM-AggreFact, collected from recent work on fact-checking and grounding LLM generations. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2404.10774

Country:

Europe (1.00)
North America > United States > Louisiana (0.14)
North America > United States > Texas (0.14)

Genre: Research Report > New Finding (0.92)

Industry:

Leisure & Entertainment (1.00)
Government (1.00)
Banking & Finance > Economy (1.00)
Media > Film (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Tang, Liyan, Shalyminov, Igor, Wong, Amy Wing-mei, Burnsky, Jon, Vincent, Jake W., Yang, Yu'an, Singh, Siffi, Feng, Song, Song, Hwanjun, Su, Hang, Sun, Lijia, Zhang, Yi, Mansour, Saab, McKeown, Kathleen

arXiv.org Artificial IntelligenceMar-31-2024

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

explanation, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2402.13249

Country:

North America > United States > Texas (0.14)
North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry:

Government (1.00)
Transportation > Air (0.46)
Automobiles & Trucks > Manufacturer (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)

Add feedback

Less Likely Brainstorming: Using Language Models to Generate Alternative Hypotheses

Tang, Liyan, Peng, Yifan, Wang, Yanshan, Ding, Ying, Durrett, Greg, Rousseau, Justin F.

arXiv.org Artificial IntelligenceMay-30-2023

A human decision-maker benefits the most from an AI assistant that corrects for their biases. For problems such as generating interpretation of a radiology report given findings, a system predicting only highly likely outcomes may be less useful, where such outcomes are already obvious to the user. To alleviate biases in human decision-making, it is worth considering a broad differential diagnosis, going beyond the most likely options. We introduce a new task, "less likely brainstorming," that asks a model to generate outputs that humans think are relevant but less likely to happen. We explore the task in two settings: a brain MRI interpretation generation setting and an everyday commonsense reasoning setting. We found that a baseline approach of training with less likely hypotheses as targets generates outputs that humans evaluate as either likely or irrelevant nearly half of the time; standard MLE training is not effective. To tackle this problem, we propose a controlled text generation method that uses a novel contrastive learning strategy to encourage models to differentiate between generating likely and less likely outputs according to humans. We compare our method with several state-of-the-art controlled text generation models via automatic and human evaluations and show that our models' capability of generating less likely outputs is improved.

artificial intelligence, expert system, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.19339

Country: North America > United States > Texas (0.14)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.88)
Health & Medicine > Diagnostic Medicine > Imaging (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.34)

Add feedback

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

Tang, Liyan, Goyal, Tanya, Fabbri, Alexander R., Laban, Philippe, Xu, Jiacheng, Yavuz, Semih, Kryściński, Wojciech, Rousseau, Justin F., Durrett, Greg

arXiv.org Artificial IntelligenceMay-25-2023

The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems' outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2205.12854

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Making Document-Level Information Extraction Right for the Right Reasons

Tang, Liyan, Rajan, Dhruv, Mohan, Suyash, Pradhan, Abhijeet, Bryan, R. Nick, Durrett, Greg

arXiv.org Artificial IntelligenceOct-14-2021

Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in radiology a report may not be explicitly stated, but nevertheless can be inferred from the report's text. However, document-level neural models can easily learn spurious correlations from irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. While this basic approach can extract reasonable evidence, it can be regularized with small amounts of evidence supervision during training, which substantially improves the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.

data mining, diagnostic medicine, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2110.07686

Country:

Asia (0.68)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.64)

Industry:

Health & Medicine > Nuclear Medicine (0.89)
Health & Medicine > Diagnostic Medicine > Imaging (0.89)
Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.61)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback