AITopics | comparison question

Collaborating Authors

comparison question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

d30d0f522a86b3665d8e3a9a91472e28-Supplemental.pdf

Neural Information Processing SystemsFeb-11-2026, 07:30:34 GMT

crowd worker, hypothesis, titan rtx gpus, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

Zhou, Yongxin, Mulhem, Philippe, Schwab, Didier

arXiv.org Artificial IntelligenceDec-2-2025

The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2512.01183

Country:

Europe (1.00)
North America > United States (0.94)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions

Shen, Zhiyu, Liu, Jiyuan, Pang, Yunhe, Rao, Yanghui

arXiv.org Artificial IntelligenceOct-9-2025

Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced QA models, especially in domains with scarce resources.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2505.15087

Country:

Europe (1.00)
North America > United States > California (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

A Proofs

Neural Information Processing SystemsAug-22-2025, 01:13:42 GMT

In this section, we give full proofs of the two main theorems in the paper. 's invertibility and equality 9 follows from Definition 5. Then for By Jensen's inequality, we have: ξ In this section, we give more details of the algorithms we used in the paper. For each i { 1, 2,...,m }, there are n Our code is written with PyTorch. Section 4 and we choose our hyperparameters by the validation performance on the dev sets. The majority of the MNLI corpus is released under the OANC's license, and CMLE method (see Equation 4).

artificial intelligence, hypothesis, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

STOC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering

Bi, Zhenyu, Hajialigol, Daniel, Sun, Zhongkai, Hao, Jie, Wang, Xuan

arXiv.org Artificial IntelligenceJul-4-2024

Multi-hop question answering (MHQA) requires a model to retrieve and integrate information from multiple passages to answer a complex question. Recent systems leverage the power of large language models and integrate evidence retrieval with reasoning prompts (e.g., chain-of-thought reasoning) for the MHQA task. However, the complexities in the question types (bridge v.s. comparison questions) and the reasoning types (sequential v.s. parallel reasonings) require more novel and fine-grained prompting methods to enhance the performance of MHQA under the zero-shot setting. In this paper, we propose STOC-TOT, a stochastic tree-of-thought reasoning prompting method with constrained decoding for MHQA and conduct a detailed comparison with other reasoning prompts on different question types and reasoning types. Specifically, we construct a tree-like reasoning structure by prompting the model to break down the original question into smaller sub-questions to form different reasoning paths. In addition, we prompt the model to provide a probability estimation for each reasoning path at each reasoning step. At answer time, we conduct constrained decoding on the model to generate more grounded answers and reduce hallucination. Experiments comparing STOC-TOT with two MHQA datasets and five large language models showed that our framework outperforms other reasoning prompts by a significant margin.

dataset, reasoning, reasoning path, (16 more...)

arXiv.org Artificial Intelligence

2407.03687

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
North America > United States > Virginia (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Bayesian Preference Elicitation with Language Models

Handa, Kunal, Gal, Yarin, Pavlick, Ellie, Goodman, Noah, Andreas, Jacob, Tamkin, Alex, Li, Belinda Z.

arXiv.org Artificial IntelligenceMar-8-2024

Aligning AI systems to users' interests requires understanding and incorporating humans' complex values and preferences. Recently, language models (LMs) have been used to gather information about the preferences of human users. This preference data can be used to fine-tune or guide other LMs and/or AI systems. However, LMs have been shown to struggle with crucial aspects of preference learning: quantifying uncertainty, modeling human mental states, and asking informative questions. These challenges have been addressed in other areas of machine learning, such as Bayesian Optimal Experimental Design (BOED), which focus on designing informative queries within a well-defined feature space. But these methods, in turn, are difficult to scale and apply to real-world problems where simply identifying the relevant features can be difficult. We introduce OPEN (Optimal Preference Elicitation with Natural language) a framework that uses BOED to guide the choice of informative questions and an LM to extract features and translate abstract BOED queries into natural language questions. By combining the flexibility of LMs with the rigor of BOED, OPEN can optimize the informativity of queries while remaining adaptable to real-world domains. In user studies, we find that OPEN outperforms existing LM- and BOED-based methods for preference elicitation.

comparison question, human preference, user preference, (15 more...)

arXiv.org Artificial Intelligence

2403.05534

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.28)
North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(6 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Consumer Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

POSQA: Probe the World Models of LLMs with Size Comparisons

Shu, Chang, Han, Jiuzhou, Liu, Fangyu, Shareghi, Ehsan, Collier, Nigel

arXiv.org Artificial IntelligenceOct-20-2023

Embodied language comprehension emphasizes that language understanding is not solely a matter of mental processing in the brain but also involves interactions with the physical and social environment. With the explosive growth of Large Language Models (LLMs) and their already ubiquitous presence in our daily lives, it is becoming increasingly necessary to verify their real-world understanding. Inspired by cognitive theories, we propose POSQA: a Physical Object Size Question Answering dataset with simple size comparison questions to examine the extremity and analyze the potential mechanisms of the embodied comprehension of the latest LLMs. We show that even the largest LLMs today perform poorly under the zero-shot setting. We then push their limits with advanced prompting techniques and external knowledge augmentation. Furthermore, we investigate whether their real-world comprehension primarily derives from contextual information or internal weights and analyse the impact of prompt formats and report bias of different objects. Our results show that real-world understanding that LLMs shaped from textual data can be vulnerable to deception and confusion by the surface form of prompts, which makes it less aligned with human behaviours.

information, knowledge, llm, (14 more...)

arXiv.org Artificial Intelligence

2310.13394

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Europe > Austria (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
Health & Medicine > Therapeutic Area > Immunology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering

Ho, Xanh, Nguyen, Anh-Khoa Duong, Sugawara, Saku, Aizawa, Akiko

arXiv.org Artificial IntelligenceFeb-12-2023

To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).

hotpotqa-small, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2302.05963

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Italy > Tuscany > Florence (0.05)
(14 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Film (0.68)
Leisure & Entertainment (0.68)
Transportation > Air (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

Ho, Xanh, Sugawara, Saku, Aizawa, Akiko

arXiv.org Artificial IntelligenceOct-11-2022

Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, \textit{HieraDate}, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.

artificial intelligence, natural language, question answering, (18 more...)

arXiv.org Artificial Intelligence

2210.05208

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Italy > Tuscany > Florence (0.05)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Assessment & Standards > Student Performance (0.61)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.34)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.34)

Add feedback

Multi-hop Reading Comprehension through Question Decomposition and Rescoring

Min, Sewon, Zhong, Victor, Zettlemoyer, Luke, Hajishirzi, Hannaneh

arXiv.org Artificial IntelligenceJun-7-2019

Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast sub-question generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.

machine learning, natural language, question answering, (17 more...)

arXiv.org Artificial Intelligence

1906.02916

Country:

North America > United States > Virginia (0.05)
Oceania > Northern Mariana Islands > Saipan > Saipan (0.04)
North America > United States > Ohio (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry:

Education > Assessment & Standards > Student Performance (0.62)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.49)

Add feedback