AITopics

2406.1799

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(6 more...)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Sung, Yoo Yeon, Fleisig, Eve, Mondal, Ishani, Boyd-Graber, Jordan Lee

ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks

arXiv.org Artificial IntelligenceJun-24-2024

Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset is and exposes the features that make data adversarial. We then use ADVSCORE to underpin a dataset creation pipeline that incentivizes writing a high-quality adversarial dataset. As a proof of concept, we use ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from our pipeline. The high-quality questions in ADVQA surpasses three adversarial benchmarks across domains at fooling several models but not humans. We validate our result based on difficulty estimates from 9,347 human responses on four datasets and predictions from three models. Moreover, ADVSCORE uncovers which adversarial tactics used by human writers fool models (e.g., GPT-4) but not humans. Through ADVSCORE and its analyses, we offer guidance on revealing language model vulnerabilities and producing reliable adversarial examples.

computational linguistic, dataset, proceedings, (15 more...)

2406.16342

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Maryland (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(17 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Education (0.93)
Government (0.67)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.48)
(2 more...)

Prabhu, Venktesh V. Deepali, Anand, Avishek

DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs

arXiv.org Artificial IntelligenceJun-24-2024

Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER

computational linguistic, dataset, reasoning, (15 more...)

2406.17158

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)
(14 more...)

Genre: Research Report > New Finding (0.67)

Industry: Leisure & Entertainment (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Leite, Bernardo, Osório, Tomás Freitas, Cardoso, Henrique Lopes

FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages

arXiv.org Artificial IntelligenceJun-24-2024

Question Answering (QA) datasets are crucial in assessing reading comprehension skills for both machines and humans. While numerous datasets have been developed in English for this purpose, a noticeable void exists in less-resourced languages. To alleviate this gap, our paper introduces machine-translated versions of FairytaleQA, a renowned QA dataset designed to assess and enhance narrative comprehension skills in young children. By employing fine-tuned, modest-scale models, we establish benchmarks for both Question Generation (QG) and QA tasks within the translated datasets. In addition, we present a case study proposing a model for generating question-answer pairs, with an evaluation incorporating quality metrics such as question well-formedness, answerability, relevance, and children suitability. Our evaluation prioritizes quantifying and describing error cases, along with providing directions for future work. This paper contributes to the advancement of QA and QG research in less-resourced languages, promoting accessibility and inclusivity in the development of these models for reading comprehension. The code and data is publicly available at github.com/bernardoleite/fairytaleqa-translated.

dataset, qa pair, translation, (15 more...)

2406.04233

Country:

Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Switzerland (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Education > Assessment & Standards > Student Performance (0.55)
Education > Educational Setting > K-12 Education (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

arXiv.org Artificial IntelligenceJun-21-2024

Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis

Fan, Lin, Gong, Xun, Zheng, Cenyang, Ou, Yafei

The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.

inference, reasoning, tri-vqa, (17 more...)

2406.1505

Country:

Asia > China > Sichuan Province > Chengdu (0.05)
Europe > Switzerland (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
(3 more...)

Genre: Research Report > New Finding (0.66)

Industry:

Health & Medicine > Diagnostic Medicine > Imaging (0.95)
Health & Medicine > Therapeutic Area (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)
(2 more...)

FoRAG: Factuality-optimized Retrieval Augmented Generation for Web-enhanced Long-form Question Answering

Cai, Tianchi, Tan, Zhiwen, Song, Xierui, Sun, Tao, Jiang, Jiyan, Xu, Yunqi, Zhang, Yinger, Gu, Jinjie

Retrieval Augmented Generation (RAG) has become prevalent in question-answering (QA) tasks due to its ability of utilizing search engine to enhance the quality of long-form question-answering (LFQA). Despite the emergence of various open source methods and web-enhanced commercial systems such as Bing Chat, two critical problems remain unsolved, i.e., the lack of factuality and clear logic in the generated long-form answers. In this paper, we remedy these issues via a systematic study on answer generation in web-enhanced LFQA. Specifically, we first propose a novel outline-enhanced generator to achieve clear logic in the generation of multifaceted answers and construct two datasets accordingly. Then we propose a factuality optimization method based on a carefully designed doubly fine-grained RLHF framework, which contains automatic evaluation and reward modeling in different levels of granularity. Our generic framework comprises conventional fine-grained RLHF methods as special cases. Extensive experiments verify the superiority of our proposed \textit{Factuality-optimized RAG (FoRAG)} method on both English and Chinese benchmarks. In particular, when applying our method to Llama2-7B-chat, the derived model FoRAG-L-7B outperforms WebGPT-175B in terms of three commonly used metrics (i.e., coherence, helpfulness, and factuality), while the number of parameters is much smaller (only 1/24 of that of WebGPT-175B). Our datasets and models are made publicly available for better reproducibility: https://huggingface.co/forag.

barcelona, dataset, factuality-optimized retrieval augmented generation, (10 more...)

doi: 10.1145/3637528.3672065

2406.13779

Country:

Europe > United Kingdom (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.06)
Asia > Middle East > Syria (0.04)
(21 more...)

Genre:

Research Report (1.00)
Personal > Honors (0.68)

Industry:

Education (1.00)
Energy > Power Industry > Utilities > Nuclear (0.46)
Health & Medicine > Health Care Technology > Telehealth (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

An, Kaikai, Yang, Fangkai, Li, Liqun, Lu, Junting, Cheng, Sitao, Wang, Lu, Zhao, Pu, Cao, Lele, Lin, Qingwei, Rajmohan, Saravan, Zhang, Dongmei, Zhang, Qi

Current question answering systems leveraging retrieval augmented generation perform well in answering factoid questions but face challenges with non-factoid questions, particularly how-to queries requiring detailed step-by-step instructions and explanations. In this paper, we introduce Thread, a novel data organization paradigm that transforms documents into logic units based on their inter-connectivity. Extensive experiments across open-domain and industrial scenarios demonstrate that Thread outperforms existing data organization paradigms in RAG-based QA systems, significantly improving the handling of how-to questions.

dataset, how-to question, paradigm, (15 more...)

2406.13372

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Workflow (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.85)

Enhancing Distractor Generation for Multiple-Choice Questions with Retrieval Augmented Pretraining and Knowledge Graph Integration

Yu, Han-Cheng, Shih, Yu-An, Law, Kin-Man, Hsieh, Kai-Yu, Cheng, Yu-Chen, Ho, Hsin-Chih, Lin, Zih-An, Hsu, Wen-Chuan, Fan, Yao-Chung

In this paper, we tackle the task of distractor generation (DG) for multiple-choice questions. Our study introduces two key designs. First, we propose \textit{retrieval augmented pretraining}, which involves refining the language model pretraining to align it more closely with the downstream task of DG. Second, we explore the integration of knowledge graphs to enhance the performance of DG. Through experiments with benchmarking datasets, we show that our models significantly outperform the state-of-the-art results. Our best-performing model advances the F1@3 score from 14.80 to 16.47 in MCQ dataset and from 15.92 to 16.50 in Sciq dataset.

dataset, distractor, triplet, (13 more...)

2406.13578

Country: Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.85)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.62)

Srivastava, Akchay, Memon, Atif

Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models

Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.

computational linguistic, dataset, linguistic, (14 more...)

2406.13232

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Italy > Tuscany > Florence (0.04)
Europe > France (0.04)
(23 more...)

Genre:

Overview (1.00)
Research Report > New Finding (0.45)

Industry:

Media (1.00)
Leisure & Entertainment (0.67)
Government > Regional Government (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-17-2024

Reference-based Metrics Disprove Themselves in Question Generation

Nguyen, Bang, Yu, Mengxia, Huang, Yun, Jiang, Meng

Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.

complexity, computational linguistic, metric, (15 more...)

2403.12242

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(11 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Government > Military (0.93)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)