AITopics | Radev, Dragomir

Collaborating Authors

Radev, Dragomir

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Han, Simeng, Yu, Aaron, Shen, Rui, Qi, Zhenting, Riddell, Martin, Zhou, Wenfei, Qiao, Yujie, Zhao, Yilun, Yavuz, Semih, Liu, Ye, Joty, Shafiq, Zhou, Yingbo, Xiong, Caiming, Radev, Dragomir, Ying, Rex, Cohan, Arman

arXiv.org Artificial IntelligenceOct-11-2024

Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is collected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore, fine-tuning Llama3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets. We also conduct detailed analysis to show where most powerful LLMs fall short in reasoning. We will release the dataset and code publicly.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.09207

Country:

Asia (0.68)
Europe > Spain (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models

Chi, Nathan A., Malchev, Teodor, Kong, Riley, Chi, Ryan A., Huang, Lucas, Chi, Ethan A., McCoy, R. Thomas, Radev, Dragomir

arXiv.org Artificial IntelligenceJun-24-2024

We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.17038

Country: North America > United States > Washington > King County > Seattle (0.14)

Genre: Research Report (0.40)

Industry: Education > Educational Setting > K-12 Education > Secondary School (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.66)

Add feedback

Fair Abstractive Summarization of Diverse Perspectives

Zhang, Yusen, Zhang, Nan, Liu, Yixin, Fabbri, Alexander, Liu, Junru, Kamoi, Ryo, Lu, Xiaoxin, Xiong, Caiming, Zhao, Jieyu, Radev, Dragomir, McKeown, Kathleen, Zhang, Rui

arXiv.org Artificial IntelligenceMar-29-2024

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at https://github.com/psunlpgroup/FairSumm.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2311.07884

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Leisure & Entertainment (0.92)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

A Call for Clarity in Beam Search: How It Works and When It Stops

Kasai, Jungo, Sakaguchi, Keisuke, Bras, Ronan Le, Radev, Dragomir, Choi, Yejin, Smith, Noah A.

arXiv.org Artificial IntelligenceFeb-28-2024

Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication.

machine learning, natural language, proc, (20 more...)

arXiv.org Artificial Intelligence

2204.05424

Country:

North America > United States (0.28)
Europe (0.28)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HPE:Answering Complex Questions over Text by Hybrid Question Parsing and Execution

Liu, Ye, Yavuz, Semih, Meng, Rui, Radev, Dragomir, Xiong, Caiming, Zhou, Yingbo

arXiv.org Artificial IntelligenceJan-5-2024

The dominant paradigm of textual question answering systems is based on end-to-end neural networks, which excels at answering natural language questions but falls short on complex ones. This stands in contrast to the broad adaptation of semantic parsing approaches over structured data sources (e.g., relational database, knowledge graphs), that convert natural language questions to logical forms and execute them with query engines. Towards combining the strengths of neural and symbolic methods, we propose a framework of question parsing and execution on textual QA. It comprises two central pillars: (1) We parse the question of varying complexity into an intermediate representation, named H-expression, which is composed of simple questions as the primitives and symbolic operations representing the relationships among them; (2) To execute the resulting H-expressions, we design a hybrid executor, which integrates the deterministic rules to translate the symbolic operations with a drop-in neural reader network to answer each decomposed simple question. Hence, the proposed framework can be viewed as a top-down question parsing followed by a bottom-up answer backtracking. The resulting H-expressions closely guide the execution process, offering higher precision besides better interpretability while still preserving the advantages of the neural readers for resolving its primitive elements. Our extensive experiments on MuSiQue, 2WikiQA, HotpotQA, and NQ show that the proposed parsing and hybrid execution framework outperforms existing approaches in supervised, few-shot, and zero-shot settings, while also effectively exposing its underlying reasoning process.

artificial intelligence, information retrieval query processing, natural language, (17 more...)

arXiv.org Artificial Intelligence

2305.07789

Country: North America > United States (0.93)

Genre: Research Report (0.82)

Industry:

Government (0.68)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.34)

Add feedback

Ascle: A Python Natural Language Processing Toolkit for Medical Text Generation

Yang, Rui, Zeng, Qingcheng, You, Keen, Qiao, Yujie, Huang, Lucas, Hsieh, Chia-Chun, Rosand, Benjamin, Goldwasser, Jeremy, Dave, Amisha D, Keenan, Tiarnan D. L., Chew, Emily Y, Radev, Dragomir, Lu, Zhiyong, Xu, Hua, Chen, Qingyu, Li, Irene

arXiv.org Artificial IntelligenceDec-9-2023

This study introduces Ascle, a pioneering natural language processing (NLP) toolkit designed for medical text generation. Ascle is tailored for biomedical researchers and healthcare professionals with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle evaluates and provides interfaces for the latest pre-trained language models, encompassing four advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. The toolkit, its models, and associated data are publicly available via https://github.com/Yale-LILY/MedGen.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2311.16588

Country:

North America > United States > Massachusetts (0.14)
North America > United States > Pennsylvania (0.14)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Consumer Health (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Liu, Yixin, Fabbri, Alexander R., Zhao, Yilun, Liu, Pengfei, Joty, Shafiq, Wu, Chien-Sheng, Xiong, Caiming, Radev, Dragomir

arXiv.org Artificial IntelligenceNov-16-2023

Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics. In this work, we develop strong-performing automatic metrics for reference-based summarization evaluation, based on a two-stage evaluation pipeline that first extracts basic information units from one text sequence and then checks the extracted units in another sequence. The metrics we developed include two-stage metrics that can provide high interpretability at both the fine-grained unit level and summary level, and one-stage metrics that achieve a balance between efficiency and interpretability. We make the developed tools publicly available at https://github.com/Yale-LILY/AutoACU.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2303.03608

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)
North America > United States > Massachusetts (0.28)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

On Learning to Summarize with Large Language Models as References

Liu, Yixin, Shi, Kejian, He, Katherine S, Ye, Longtian, Fabbri, Alexander R., Liu, Pengfei, Radev, Dragomir, Cohan, Arman

arXiv.org Artificial IntelligenceNov-16-2023

Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we investigate a new learning setting of text summarization models that considers the LLMs as the reference or the gold-standard oracle on these datasets. To examine the standard practices that are aligned with this new learning setting, we investigate two LLM-based summary quality evaluation methods for model training and adopt a contrastive learning training method to leverage the LLM-guided learning signals. Our experiments on the CNN/DailyMail and XSum datasets demonstrate that smaller summarization models can achieve similar performance as LLMs under LLM-based evaluation. However, we found that the smaller models can not yet reach LLM-level performance under human evaluation despite promising improvements brought by our proposed training methods. Meanwhile, we perform a meta-analysis on this new learning setting that reveals a discrepancy between human and LLM-based evaluation, highlighting the benefits and risks of this LLM-as-reference setting we investigated.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2305.14239

Country:

North America > United States (1.00)
Europe > United Kingdom (1.00)
Asia (1.00)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.93)
Law (0.93)
Leisure & Entertainment > Sports (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.80)

Add feedback

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Liu, Yixin, Fabbri, Alexander R., Chen, Jiawen, Zhao, Yilun, Han, Simeng, Joty, Shafiq, Liu, Pengfei, Radev, Dragomir, Wu, Chien-Sheng, Cohan, Arman

arXiv.org Artificial IntelligenceNov-15-2023

While large language models (LLMs) already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for the desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluation on 5 LLM-based summarization systems. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods in total. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation. We make our collected benchmark, InstruSum, publicly available to facilitate future research in this direction.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2311.09184

Country:

North America > United States (0.46)
Asia > China (0.46)
Asia > Middle East > UAE (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

QTSumm: Query-Focused Summarization over Tabular Data

Zhao, Yilun, Qi, Zhenting, Nan, Linyong, Mi, Boyu, Liu, Yixin, Zou, Weijin, Han, Simeng, Chen, Ruizhe, Tang, Xiangru, Xu, Yumo, Radev, Dragomir, Cohan, Arman

arXiv.org Artificial IntelligenceNov-6-2023

People primarily consult tables to conduct data analysis or answer specific questions. Text generation systems that can provide accurate table summaries tailored to users' information needs can facilitate more efficient access to relevant data insights. Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary. We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables covering diverse topics. We investigate a set of strong baselines on QTSumm, including text generation, table-to-text generation, and large language models. Experimental results and manual analysis reveal that the new task presents significant challenges in table-to-text generation for future research. Moreover, we propose a new approach named ReFactor, to retrieve and reason over query-relevant information from tabular data to generate several natural language facts. Experimental results demonstrate that ReFactor can bring improvements to baselines by concatenating the generated facts to the model input. Our data and code are publicly available at https://github.com/yale-nlp/QTSumm.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.14303

Country:

North America > United States (1.00)
Asia (1.00)
Europe (0.93)

Genre: Research Report > New Finding (0.48)

Industry:

Leisure & Entertainment > Sports (1.00)
Government (0.93)
Energy > Oil & Gas (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback