Goto

Collaborating Authors

 mcqa


Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT

Liu, Yesheng, Li, Hao, Xu, Haiyu, Pei, Baoqi, Wang, Jiahao, Zhao, Mingxuan, Zheng, Jingshu, He, Zheqi, Yao, JG, Qin, Bowen, Yang, Xi, Zhang, Jiajun

arXiv.org Artificial Intelligence

Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.


Exploring NLP Benchmarks in an Extremely Low-Resource Setting

Nuha, Ulin, Jatowt, Adam

arXiv.org Artificial Intelligence

The effectiveness of Large Language Models (LLMs) diminishes for extremely low-resource languages, such as indigenous languages, primarily due to the lack of labeled data. Despite growing interest, the availability of high-quality natural language processing (NLP) datasets for these languages remains limited, making it difficult to develop robust language technologies. This paper addresses such gap by focusing on Ladin, an endangered Romance language, specifically targeting the Val Badia variant. Leveraging a small set of parallel Ladin-Italian sentence pairs, we create synthetic datasets for sentiment analysis and multiple-choice question answering (MCQA) by translating monolingual Italian data. To ensure linguistic quality and reliability, we apply rigorous filtering and back-translation procedures in our method. We further demonstrate that incorporating these synthetic datasets into machine translation training leads to substantial improvements over existing Italian-Ladin translation baselines. Our contributions include the first publicly available sentiment analysis and MCQA datasets for Ladin, establishing foundational resources that can support broader NLP research and downstream applications for this underrepresented language.


Impact of Comments on LLM Comprehension of Legacy Code

Sabetto, Rock, Escamilla, Emily, Agarwal, Devesh, Kandwal, Sujay, Brunelle, Justin F., Rosen, Scott, Naik, Nitin, Thaker, Samruddhi, Scott, Eric O., Zimmer, Jacob, Madan, Amit, Sridharan, Arun, Wendt, Doug, Doyle, Michael, Glasz, Christopher, Phillips, Jasper, Macke, William, Diggs, Colin, Bartholf, Michael, Robin, Zachary, Ursino, Paul

arXiv.org Artificial Intelligence

Impact of Comments on LLM Comprehension of Legacy Code Rock Sabetto, Emily Escamilla, Devesh Agarwal, Sujay Kandwal, Dr. Justin F. Brunelle, Dr. Scott Rosen, Dr. Nitin Naik, Dr. Samruddhi Thaker, Dr. Eric O. Scott, Jacob Zimmer, Amit Madan, Arun Sridharan, Doug Wendt, Michael Doyle, Christopher Glasz, Jasper Phillips, William Macke, Colin Diggs, Michael Bartholf, Zachary Robin, and Paul Ursino The MITRE Corporation McLean, V A rsabetto@mitre.org Abstract --Large language models (LLMs) have been increasingly integrated into software engineering and maintenance tasks due to their high performance with software engineering tasks and robust understanding of modern programming languages. However, the ability of LLMs to comprehend code written with legacy languages remains a research gap challenged by real-world legacy systems lacking or containing inaccurate documentation that may impact LLM comprehension. T o assess LLM comprehension of legacy languages, there is a need for objective LLM evaluation. In order to objectively measure LLM comprehension of legacy languages, we need an efficient, quantitative evaluation method. We leverage multiple-choice question answering (MCQA), an emerging LLM evaluation methodology, to evaluate LLM comprehension of legacy code and the impact of comment prevalence and inaccurate comments. In this work, we present preliminary findings on the impact of documentation on LLM comprehension of legacy code and outline strategic objectives for future work.


Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models

Na, Injae, Noh, Keonwoong, Jung, Woohwan

arXiv.org Artificial Intelligence

LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.


Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning

Bhattacharya, Debarpan, Kulkarni, Apoorva, Ganapathy, Sriram

arXiv.org Artificial Intelligence

The popular success of text-based large language models (LLM) has streamlined the attention of the multimodal community to combine other modalities like vision and audio along with text to achieve similar multimodal capabilities. In this quest, large audio language models (LALMs) have to be evaluated on reasoning related tasks which are different from traditional classification or generation tasks. Towards this goal, we propose a novel dataset called temporal reasoning evaluation of audio (TREA). We benchmark open-source LALMs and observe that they are consistently behind human capabilities on the tasks in the TREA dataset. While evaluating LALMs, we also propose an uncertainty metric, which computes the invariance of the model to semantically identical perturbations of the input. Our analysis shows that the accuracy and uncertainty metrics are not necessarily correlated and thus, points to a need for wholesome evaluation of LALMs for high-stakes applications. The dataset and code are made public.


Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant, Rudinger, Rachel, Boyd-Graber, Jordan Lee

arXiv.org Artificial Intelligence

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.


The use of large language models to enhance cancer clinical trial educational materials

Gao, Mingye, Varshney, Aman, Chen, Shan, Goddla, Vikram, Gallifant, Jack, Doyle, Patrick, Novack, Claire, Dillon-Martin, Maeve, Perkins, Teresia, Correia, Xinrong, Duhaime, Erik, Isenstein, Howard, Sharon, Elad, Lehmann, Lisa Soleymani, Kozono, David, Anthony, Brian, Dligach, Dmitriy, Bitterman, Danielle S.

arXiv.org Artificial Intelligence

Cancer clinical trials often face challenges in recruitment and engagement due to a lack of participant-facing informational and educational resources. This study investigated the potential of Large Language Models (LLMs), specifically GPT4, in generating patient-friendly educational content from clinical trial informed consent forms. Using data from ClinicalTrials.gov, we employed zero-shot learning for creating trial summaries and one-shot learning for developing multiple-choice questions, evaluating their effectiveness through patient surveys and crowdsourced annotation. Results showed that GPT4-generated summaries were both readable and comprehensive, and may improve patients' understanding and interest in clinical trials. The multiple-choice questions demonstrated high accuracy and agreement with crowdsourced annotators. For both resource types, hallucinations were identified that require ongoing human oversight. The findings demonstrate the potential of LLMs "out-of-the-box" to support the generation of clinical trial education materials with minimal trial-specific engineering, but implementation with a human-in-the-loop is still needed to avoid misinformation risks.


Differentiating Choices via Commonality for Multiple-Choice Question Answering

Deng, Wenqing, Wang, Zhe, Wang, Kewen, Pan, Shirui, Zhang, Xiaowang, Feng, Zhiyong

arXiv.org Artificial Intelligence

Multiple-choice question answering (MCQA) becomes particularly challenging when all choices are relevant to the question and are semantically similar. Yet this setting of MCQA can potentially provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. Specifically, they fail to leverage the semantic commonalities and nuances among the choices for reasoning. In this paper, we propose a novel MCQA model by differentiating choices through identifying and eliminating their commonality, called DCQA. Our model captures token-level attention of each choice to the question, and separates tokens of the question attended to by all the choices (i.e., commonalities) from those by individual choices (i.e., nuances). Using the nuances as refined contexts for the choices, our model can effectively differentiate choices with subtle differences and provide justifications for choosing the correct answer. We conduct comprehensive experiments across five commonly used MCQA benchmarks, demonstrating that DCQA consistently outperforms baseline models. Furthermore, our case study illustrates the effectiveness of the approach in directing the attention of the model to more differentiating features.


Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?

Balepur, Nishant, Rudinger, Rachel

arXiv.org Artificial Intelligence

Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility~of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards just due to their ability to exploit choices-only shortcuts.


Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Balepur, Nishant, Ravichander, Abhilasha, Rudinger, Rachel

arXiv.org Artificial Intelligence

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.