AITopics | mcqa

Collaborating Authors

mcqa

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Liu, Yesheng, Li, Hao, Xu, Haiyu, Pei, Baoqi, Wang, Jiahao, Zhao, Mingxuan, Zheng, Jingshu, He, Zheqi, Yao, JG, Qin, Bowen, Yang, Xi, Zhang, Jiajun

arXiv.org Artificial IntelligenceNov-25-2025

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.17405

Country:

North America > United States (0.68)
Asia (0.46)

Genre:

Questionnaire & Opinion Survey (0.56)
Research Report (0.50)

Industry:

Education (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.47)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Nuha, Ulin, Jatowt, Adam

arXiv.org Artificial IntelligenceSep-5-2025

The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.

large language model, machine learning, translation, (17 more...)

arXiv.org Artificial Intelligence

2509.03962

Country: Europe > Italy (0.29)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Impact of Comments on LLM Comprehension of Legacy Code

Sabetto, Rock, Escamilla, Emily, Agarwal, Devesh, Kandwal, Sujay, Brunelle, Justin F., Rosen, Scott, Naik, Nitin, Thaker, Samruddhi, Scott, Eric O., Zimmer, Jacob, Madan, Amit, Sridharan, Arun, Wendt, Doug, Doyle, Michael, Glasz, Christopher, Phillips, Jasper, Macke, William, Diggs, Colin, Bartholf, Michael, Robin, Zachary, Ursino, Paul

arXiv.org Artificial IntelligenceJun-16-2025

Impact of Comments on LLM Comprehension of Legacy Code Rock Sabetto, Emily Escamilla, Devesh Agarwal, Sujay Kandwal, Dr. Justin F. Brunelle, Dr. Scott Rosen, Dr. Nitin Naik, Dr. Samruddhi Thaker, Dr. Eric O. Scott, Jacob Zimmer, Amit Madan, Arun Sridharan, Doug Wendt, Michael Doyle, Christopher Glasz, Jasper Phillips, William Macke, Colin Diggs, Michael Bartholf, Zachary Robin, and Paul Ursino The MITRE Corporation McLean, V A rsabetto@mitre.org Abstract --Large language models (LLMs) have been increasingly integrated into software engineering and maintenance tasks due to their high performance with software engineering tasks and robust understanding of modern programming languages. However, the ability of LLMs to comprehend code written with legacy languages remains a research gap challenged by real-world legacy systems lacking or containing inaccurate documentation that may impact LLM comprehension. T o assess LLM comprehension of legacy languages, there is a need for objective LLM evaluation. In order to objectively measure LLM comprehension of legacy languages, we need an efficient, quantitative evaluation method. We leverage multiple-choice question answering (MCQA), an emerging LLM evaluation methodology, to evaluate LLM comprehension of legacy code and the impact of comment prevalence and inaccurate comments. In this work, we present preliminary findings on the impact of documentation on LLM comprehension of legacy code and outline strategic objectives for future work.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2506.11007

Genre: Research Report (1.00)

Industry: Education (0.52)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Na, Injae, Noh, Keonwoong, Jung, Woohwan

arXiv.org Artificial IntelligenceMay-30-2025

LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.

large language model, machine learning, tier, (19 more...)

arXiv.org Artificial Intelligence

2505.20921

Country: North America > Mexico (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Automobiles & Trucks (0.60)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Bhattacharya, Debarpan, Kulkarni, Apoorva, Ganapathy, Sriram

arXiv.org Artificial IntelligenceMay-20-2025

The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications. The dataset and code are made public.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2505.13115

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant, Rudinger, Rachel, Boyd-Graber, Jordan Lee

arXiv.org Artificial IntelligenceFeb-19-2025

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

computational linguistic, mcqa, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2502.14127

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Austria > Vienna (0.14)
(23 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine (1.00)
Education > Educational Setting > Higher Education (1.00)
Education > Assessment & Standards > Student Performance (1.00)
Education > Curriculum > Subject-Specific Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

The use of large language models to enhance cancer clinical trial educational materials

Gao, Mingye, Varshney, Aman, Chen, Shan, Goddla, Vikram, Gallifant, Jack, Doyle, Patrick, Novack, Claire, Dillon-Martin, Maeve, Perkins, Teresia, Correia, Xinrong, Duhaime, Erik, Isenstein, Howard, Sharon, Elad, Lehmann, Lisa Soleymani, Kozono, David, Anthony, Brian, Dligach, Dmitriy, Bitterman, Danielle S.

arXiv.org Artificial IntelligenceDec-3-2024

Cancer clinical trials often face challenges in recruitment and engagement due to a lack of participant-facing informational and educational resources. This study investigated the potential of Large Language Models (LLMs), specifically GPT4, in generating patient-friendly educational content from clinical trial informed consent forms. Using data from ClinicalTrials.gov, we employed zero-shot learning for creating trial summaries and one-shot learning for developing multiple-choice questions, evaluating their effectiveness through patient surveys and crowdsourced annotation. Results showed that GPT4-generated summaries were both readable and comprehensive, and may improve patients' understanding and interest in clinical trials. The multiple-choice questions demonstrated high accuracy and agreement with crowdsourced annotators. For both resource types, hallucinations were identified that require ongoing human oversight. The findings demonstrate the potential of LLMs "out-of-the-box" to support the generation of clinical trial education materials with minimal trial-specific engineering, but implementation with a human-in-the-loop is still needed to avoid misinformation risks.

clinical trial, mcqa, participation, (16 more...)

arXiv.org Artificial Intelligence

2412.01955

Country:

North America > United States > Massachusetts > Suffolk County > Boston (0.05)
North America > United States > Texas (0.04)
North America > United States > Maryland > Montgomery County > Bethesda (0.04)
(6 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Education (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

Add feedback

Differentiating Choices via Commonality for Multiple-Choice Question Answering

Deng, Wenqing, Wang, Zhe, Wang, Kewen, Pan, Shirui, Zhang, Xiaowang, Feng, Zhiyong

arXiv.org Artificial IntelligenceAug-21-2024

Multiple-choice question answering (MCQA) becomes particularly challenging when all choices are relevant to the question and are semantically similar. Yet this setting of MCQA can potentially provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. Specifically, they fail to leverage the semantic commonalities and nuances among the choices for reasoning. In this paper, we propose a novel MCQA model by differentiating choices through identifying and eliminating their commonality, called DCQA. Our model captures token-level attention of each choice to the question, and separates tokens of the question attended to by all the choices (i.e., commonalities) from those by individual choices (i.e., nuances). Using the nuances as refined contexts for the choices, our model can effectively differentiate choices with subtle differences and provide justifications for choosing the correct answer. We conduct comprehensive experiments across five commonly used MCQA benchmarks, demonstrating that DCQA consistently outperforms baseline models. Furthermore, our case study illustrates the effectiveness of the approach in directing the attention of the model to more differentiating features.

dataset, information, representation, (16 more...)

arXiv.org Artificial Intelligence

2408.11554

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Tianjin Province > Tianjin (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.86)

Add feedback

Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?

Balepur, Nishant, Rudinger, Rachel

arXiv.org Artificial IntelligenceJul-2-2024

Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility~of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards just due to their ability to exploit choices-only shortcuts.

accuracy, computational linguistic, dataset, (15 more...)

arXiv.org Artificial Intelligence

2407.01992

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Maryland (0.04)
Asia > China > Hong Kong (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Balepur, Nishant, Ravichander, Abhilasha, Rudinger, Rachel

arXiv.org Artificial IntelligenceJun-7-2024

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

accuracy, computational linguistic, llm, (15 more...)

arXiv.org Artificial Intelligence

2402.12483

Country:

Asia > Singapore (0.04)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Maryland (0.04)
(10 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry:

Materials (1.00)
Health & Medicine (1.00)
Education (0.85)
Information Technology (0.67)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback