brio
BRIDO: Bringing Democratic Order to Abstractive Summarization
Lee, Junhyun, Goka, Harshith, Ko, Hyeonmok
Hallucination refers to the inaccurate, irrelevant, and inconsistent text generated from large language models (LLMs). While the LLMs have shown great promise in a variety of tasks, the issue of hallucination still remains a major challenge for many practical uses. In this paper, we tackle the issue of hallucination in abstract text summarization by mitigating exposure bias. Existing models targeted for exposure bias mitigation, namely BRIO, aim for better summarization quality in the ROUGE score. We propose a model that uses a similar exposure bias mitigation strategy but with a goal that is aligned with less hallucination. We conjecture that among a group of candidate outputs, ones with hallucinations will comprise the minority of the whole group. That is, candidates with less similarity with others will have a higher chance of containing hallucinated content. Our method uses this aspect and utilizes contrastive learning, incentiviz-ing candidates with high inter-candidate ROUGE scores. We performed experiments on the XSum and CNN/DM summarization datasets, and our method showed 6.25% and 3.82% improvement, respectively, on the consistency G-Eval score over BRIO.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (4 more...)
Evaluating AI fairness in credit scoring with the BRIO tool
Coraglia, Greta, Genco, Francesco A., Piantadosi, Pellegrino, Bagli, Enrico, Giuffrida, Pietro, Posillipo, Davide, Primiero, Giuseppe
We present a method for quantitative, in-depth analyses of fairness issues in AI systems with an application to credit scoring. To this aim we use BRIO, a tool for the evaluation of AI systems with respect to social unfairness and, more in general, ethically undesirable behaviours. It features a model-agnostic bias detection module, presented in \cite{DBLP:conf/beware/CoragliaDGGPPQ23}, to which a full-fledged unfairness risk evaluation module is added. As a case study, we focus on the context of credit scoring, analysing the UCI German Credit Dataset \cite{misc_statlog_(german_credit_data)_144}. We apply the BRIO fairness metrics to several, socially sensitive attributes featured in the German Credit Dataset, quantifying fairness across various demographic segments, with the aim of identifying potential sources of bias and discrimination in a credit scoring model. We conclude by combining our results with a revenue analysis.
- Europe > Italy > Lombardy > Milan (0.04)
- Europe > Italy > Emilia-Romagna > Metropolitan City of Bologna > Bologna (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
On Learning to Summarize with Large Language Models as References
Liu, Yixin, Shi, Kejian, He, Katherine S, Ye, Longtian, Fabbri, Alexander R., Liu, Pengfei, Radev, Dragomir, Cohan, Arman
Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we investigate a new learning setting of text summarization models that considers the LLMs as the reference or the gold-standard oracle on these datasets. To examine the standard practices that are aligned with this new learning setting, we investigate two LLM-based summary quality evaluation methods for model training and adopt a contrastive learning training method to leverage the LLM-guided learning signals. Our experiments on the CNN/DailyMail and XSum datasets demonstrate that smaller summarization models can achieve similar performance as LLMs under LLM-based evaluation. However, we found that the smaller models can not yet reach LLM-level performance under human evaluation despite promising improvements brought by our proposed training methods. Meanwhile, we perform a meta-analysis on this new learning setting that reveals a discrepancy between human and LLM-based evaluation, highlighting the benefits and risks of this LLM-as-reference setting we investigated.
- Asia > Russia (0.28)
- Europe > United Kingdom > Wales (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (21 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
- Law (0.93)
- Leisure & Entertainment > Sports (0.68)
- (2 more...)
GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization
Automatic summarization with pre-trained language models has led to impressively fluent results, but is prone to 'hallucinations', low performance on non-news genres, and outputs which are not exactly summaries. Targeting ACL 2023's 'Reality Check' theme, we present GUMSum, a small but carefully crafted dataset of English summaries in 12 written and spoken genres for evaluation of abstractive summarization. Summaries are highly constrained, focusing on substitutive potential, factuality, and faithfulness. We present guidelines and evaluate human agreement as well as subjective judgments on recent system outputs, comparing general-domain untuned approaches, a fine-tuned one, and a prompt-based approach, to human performance. Results show that while GPT3 achieves impressive scores, it still underperforms humans, with varying quality across genres. Human judgments reveal different types of errors in supervised, prompted, and human-generated summaries, shedding light on the challenges of producing a good summary.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (10 more...)
- Media > News (0.47)
- Education (0.46)
- Health & Medicine (0.46)
News Summarization and Evaluation in the Era of GPT-3
Goyal, Tanya, Li, Junyi Jessy, Durrett, Greg
The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization.
- Asia > Russia (0.68)
- Africa (0.28)
- North America > United States > Missouri > Jackson County > Kansas City (0.14)
- (24 more...)
- Law > Criminal Law (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- (7 more...)