AITopics | answerability

Collaborating Authors

answerability

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Explainable Semantic Text Relations: A Question-Answering Framework for Comparing Document Content

Aperstein, Yehudit, Gottlib, Alon, Benita, Gal, Apartsin, Alexander

arXiv.org Artificial IntelligenceDec-2-2025

Understanding semantic relations between two texts is crucial for many information and document management tasks, in which one must determine whether the content fully overlaps, is completely superseded by another document, or overlaps only partially, with unique information in each. Beyond establishing this relation, it is equally important to provide explainable outputs that specify which pieces of information are present, missing, or newly added between the text pair. In this study, we formally define semantic relations between two texts through the set-theoretic relation between their respective Answerable Question Sets (AQS), the sets of questions each text can answer. Under this formulation, Semantic Text Relation (STR), such as equivalence, inclusion, and mutual overlap, becomes a well-defined set relation between the corresponding texts' AQSs. The set differences between the AQSs also serve as an explanation or diagnostic tool for identifying how the information in the texts diverges. Using this definition, we construct a synthetic benchmark that captures fine-grained informational relations through controlled paraphrasing and deliberate information removal supported by AQS manipulations. We then use this dataset to evaluate several discriminative and generative models for classifying text pairs into STR categories, assessing how well different model architectures capture semantic relations beyond surface-level similarity. We publicly release both the dataset and the data generation code to support further research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.08304

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.48)

Industry:

Government (0.69)
Health & Medicine (0.68)
Law Enforcement & Public Safety (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

Gao, Joshua, Pham, Quoc Huy, Varghese, Subin, Saurav, Silwal, Hoskere, Vedhus

arXiv.org Artificial IntelligenceNov-7-2025

Abstract--Retrieval-Augmented Generation (RAG) is a critical technique for grounding Large Language Models (LLMs) in factual evidence, yet evaluating RAG systems in specialized, safety-critical domains remains a significant challenge. Existing evaluation frameworks often rely on heuristic-based metrics that fail to capture domain-specific nuances and other works utilize LLM-as-a-Judge approaches that lack validated alignment with human judgment. This paper introduces RAGalyst, an automated, human-aligned agentic framework designed for the rigorous evaluation of domain-specific RAG systems. RAGalyst features an agentic pipeline that generates high-quality, synthetic question-answering (QA) datasets from source documents, incorporating an agentic filtering step to ensure data fidelity. The framework refines two key LLM-as-a-Judge metrics--Answer Correctness and Answerability--using prompt optimization to achieve a strong correlation with human annotations. Applying this framework to evaluate various RAG components across three distinct domains (military operations, cybersecurity, and bridge engineering), we find that performance is highly context-dependent. No single embedding model, LLM, or hyperparam-eter configuration proves universally optimal. Additionally, we provide an analysis on the most common low Answer Correctness reasons in RAG. These findings highlight the necessity of a systematic evaluation framework like RAGalyst, which empowers practitioners to uncover domain-specific trade-offs and make informed design choices for building reliable and effective RAG systems. RAGalyst is available on our Github. Although modern Large Language Models (LLMs) are great synthesizers of information, they still suffer from hallucinations [1], [2], which refers to the generation of content that appears plausible but is factually incorrect or unsupported by evidence. Mitigating hallucinations is especially important in safety-critical applications (e.g., military operations, cybersecurity, and bridge engineering) where inaccurate information can lead to serious consequences and undermine trust in artificial intelligence (AI) systems [3], [4]. Retrieval-Augmented Generation (RAG) has been widely adopted to mitigate hallucinations by grounding responses in provided context [5], [6]. A key advantage of RAG is its ability to provide models with dynamic, inference-time access to private and domain-relevant documents [5], [7].

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2511.04502

Country: Asia > Middle East (0.46)

Genre: Research Report > New Finding (0.46)

Industry: Government > Military (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

QUDsim: Quantifying Discourse Similarities in LLM-Generated Text

Namuduri, Ramya, Wu, Yating, Zheng, Anshun Asher, Wadhwa, Manya, Durrett, Greg, Li, Junyi Jessy

arXiv.org Artificial IntelligenceAug-12-2025

As large language models become increasingly capable at various writing tasks, their weakness at generating unique and creative content becomes a major liability. Although LLMs have the ability to generate text covering diverse topics, there is an overall sense of repetitiveness across texts that we aim to formalize and quantify via a similarity metric. The familiarity between documents arises from the persistence of underlying discourse structures. However, existing similarity metrics dependent on lexical overlap and syntactic patterns largely capture $\textit{content}$ overlap, thus making them unsuitable for detecting $\textit{structural}$ similarities. We introduce an abstraction based on linguistic theories in Questions Under Discussion (QUD) and question semantics to help quantify differences in discourse progression. We then use this framework to build $\textbf{QUDsim}$, a similarity metric that can detect discursive parallels between documents. Using QUDsim, we find that LLMs often reuse discourse structures (more so than humans) across samples, even when content differs. Furthermore, LLMs are not only repetitive and structurally uniform, but are also divergent from human authors in the types of structures they use.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2504.09373

Country:

North America > United States (1.00)
Asia (1.00)
Europe (0.67)

Genre: Research Report (0.82)

Industry: Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.30)

Add feedback

Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models

Yoon, Eunseop, Yoon, Hee Suk, Hasegawa-Johnson, Mark A., Yoo, Chang D.

arXiv.org Artificial IntelligenceJul-8-2025

In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.04976

Country: Asia (0.93)

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing the Learning Experience: Using Vision-Language Models to Generate Questions for Educational Videos

Stamatakis, Markos, Berger, Joshua, Wartena, Christian, Ewerth, Ralph, Hoppe, Anett

arXiv.org Artificial IntelligenceMay-6-2025

Web-based educational videos offer flexible learning opportunities and are becoming increasingly popular. However, improving user engagement and knowledge retention remains a challenge. Automatically generated questions can activate learners and support their knowledge acquisition. Further, they can help teachers and learners assess their understanding. While large language and vision-language models have been employed in various tasks, their application to question generation for educational videos remains underexplored. In this paper, we investigate the capabilities of current vision-language models for generating learning-oriented questions for educational video content. We assess (1) out-of-the-box models' performance; (2) fine-tuning effects on content-specific question generation; (3) the impact of different video modalities on question quality; and (4) in a qualitative study, question relevance, answerability, and difficulty levels of generated questions. Our findings delineate the capabilities of current vision-language models, highlighting the need for fine-tuning and addressing challenges in question diversity and relevance. We identify requirements for future multimodal datasets and outline promising research directions.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.0179

Country:

Europe > Germany > Lower Saxony (0.28)
North America > United States > California (0.28)
Asia > Middle East > UAE (0.28)

Genre:

Instructional Material (0.94)
Research Report > New Finding (0.48)

Industry:

Education > Educational Technology > Audio & Video (1.00)
Education > Educational Setting (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Do Sparse Autoencoders Generalize? A Case Study of Answerability

Heindrich, Lovis, Torr, Philip, Barez, Fazl, Thost, Veronika

arXiv.org Artificial IntelligenceFeb-27-2025

Sparse autoencoders (SAEs) have emerged as a promising approach in language model interpretability, offering unsupervised extraction of sparse features. For interpretability methods to succeed, they must identify abstract features across domains, and these features can often manifest differently in each context. We examine this through "answerability"-a model's ability to recognize answerable questions. We extensively evaluate SAE feature generalization across diverse answerability datasets for Gemma 2 SAEs. Our analysis reveals that residual stream probes outperform SAE features within domains, but generalization performance differs sharply. SAE features demonstrate inconsistent transfer ability, and residual stream probes similarly show high variance out of distribution. Overall, this demonstrates the need for quantitative methods to predict feature generalization in SAE-based interpretability.

arxiv preprint arxiv, probe, sae feature, (13 more...)

arXiv.org Artificial Intelligence

2502.19964

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
South America > Chile (0.04)
North America > United States (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Behavioral Analysis of Information Salience in Large Language Models

Trienes, Jan, Schlötterer, Jörg, Li, Junyi Jessy, Seifert, Christin

arXiv.org Artificial IntelligenceFeb-20-2025

Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.

information, salience, summarization, (14 more...)

arXiv.org Artificial Intelligence

2502.14613

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

MORTAR: Metamorphic Multi-turn Testing for LLM-based Dialogue Systems

Guo, Guoxiang, Aleti, Aldeida, Neelofar, Neelofar, Tantithamthavorn, Chakkrit

arXiv.org Artificial IntelligenceDec-19-2024

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn scenarios. However, multi-turn dialogue testing remains underexplored, with the Oracle problem in multi-turn testing posing a persistent challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a MetamORphic multi-TuRn diAlogue testing appRoach, which mitigates the test oracle problem in the assessment of LLM-based dialogue systems. MORTAR automates the generation of follow-up question-answer (QA) dialogue test cases with multiple dialogue-level perturbations and metamorphic relations. MORTAR employs a novel knowledge graph-based dialogue information model which effectively generates perturbed dialogue test datasets and detects bugs of multi-turn dialogue systems in a low-cost manner. The proposed approach does not require an LLM as a judge, eliminating potential of any biases in the evaluation step. According to the experiment results on multiple LLM-based dialogue systems and comparisons with single-turn metamorphic testing approaches, MORTAR explores more unique bugs in LLM-based dialogue systems, especially for severe bugs that MORTAR detects up to four times more unique bugs than the most effective existing metamorphic testing approach.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.15557

Country:

Asia (0.46)
North America > Mexico (0.28)
Oceania > Australia > Victoria (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Assessing the Answerability of Queries in Retrieval-Augmented Code Generation

Kim, Geonmin, Kim, Jaeyeon, Park, Hancheol, Shin, Wooksu, Kim, Tae-Ho

arXiv.org Artificial IntelligenceNov-25-2024

Thanks to unprecedented language understanding and generation capabilities of large language model (LLM), Retrieval-augmented Code Generation (RaCG) has recently been widely utilized among software developers. While this has increased productivity, there are still frequent instances of incorrect codes being provided. In particular, there are cases where plausible yet incorrect codes are generated for queries from users that cannot be answered with the given queries and API descriptions. This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated based on users' queries and retrieved APIs in RaCG. Additionally, we build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task. Experimental results show that this task remains at a very challenging level, with baseline models exhibiting a low performance of 46.7%. Furthermore, this study discusses methods that could significantly improve performance. Figure 1: An example of an LLM generating plausible code even when the request is made outside the functionality provided by the library.

dataset, information, query, (17 more...)

arXiv.org Artificial Intelligence

2411.05547

Country:

Asia > Singapore (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
North America > Dominican Republic (0.04)
(2 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference

Thorne, William, Robinson, Ambrose, Peng, Bohua, Lin, Chenghua, Maynard, Diana

arXiv.org Artificial IntelligenceOct-10-2024

As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it's equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method's effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.08289

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.05)
North America > United States > Pennsylvania (0.04)
North America > United States > New York > New York County > New York City (0.04)
(8 more...)

Genre: Research Report (0.82)

Industry: Education > Assessment & Standards (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback