multiple-choice question
- Asia > Middle East > Jordan (0.04)
- North America > United States > Texas > Dallas County > Dallas (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (2 more...)
- Research Report (0.68)
- Overview (0.46)
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > United States > Maryland > Baltimore (0.04)
- (2 more...)
- North America > United States (0.46)
- Europe > Austria > Vienna (0.14)
- Europe > Italy (0.04)
- (21 more...)
- Media > Film (1.00)
- Leisure & Entertainment > Sports > Soccer (1.00)
- Transportation > Air (0.94)
- Transportation > Infrastructure & Services > Airport (0.69)
1bdcb065d40203a00bd39831153338bb-Paper-Datasets_and_Benchmarks_Track.pdf
Our findings reveal that: I)LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III)Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty into the evaluation of LLMs.
- Asia > Singapore (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (6 more...)
Reasoning Models Ace the CFA Exams
Patel, Jaisal, Chen, Yunzhe, He, Kaiwen, Wang, Keyi, Li, David, Xiao, Kairong, Liu, Xiao-Yang
Previous research has reported that large language models (LLMs) demonstrate poor performance on the Chartered Financial Analyst (CFA) exams. However, recent reasoning models have achieved strong results on graduate-level academic and professional examinations across various disciplines. In this paper, we evaluate state-of-the-art reasoning models on a set of mock CFA exams consisting of 980 questions across three Level I exams, two Level II exams, and three Level III exams. Using the same pass/fail criteria from prior studies, we find that most models clear all three levels. The models that pass, ordered by overall performance, are Gemini 3.0 Pro, Gemini 2.5 Pro, GPT-5, Grok 4, Claude Opus 4.1, and DeepSeek-V3.1. Specifically, Gemini 3.0 Pro achieves a record score of 97.6% on Level I. Performance is also strong on Level II, led by GPT-5 at 94.3%. On Level III, Gemini 2.5 Pro attains the highest score with 86.4% on multiple-choice questions while Gemini 3.0 Pro achieves 92.0% on constructed-response questions.
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- North America > United States > New York > Rensselaer County > Troy (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > South Korea (0.04)
MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
Chen, Ruihan, Li, Qiming, Feng, Xiaocheng, Yang, Xiaoliang, Zhong, Weihong, Gu, Yuxuan, Zhou, Zekun, Qin, Bing
With the advancement of computational resources, Large Vision-Language Models (LVLMs) exhibit impressive Perception and Reasoning (P&R) performance on Graphical User Interface (GUI) tasks. However, although they demonstrate strong P&R capabilities in English GUI scenarios, their performance in multilingual settings has received little attention, which limits their global applications. Moreover, existing studies on GUI tasks lack fine-grained analyses, including widget functions and elements' spatial relationships, which are fundamental for more targeted improvements. To tackle these issues, we propose MPR-GUI-Bench, a Multilingual fine-grained Perception and Reasoning GUI Benchmark to evaluate GUI agents' P&R capabilities. Evaluation results demonstrate that LVLMs exhibit significantly worse P&R performance in non-English languages than in English. To address these gaps, we propose GUI-XLI, a GUI Cross-Lingual Intervention method that applies interventions to the hidden states at P&R capability-related layers to mitigate the gaps between English and other languages, building on previous research showing that the hidden states of different language inputs exhibit significant differences in the latent space. Experimental results indicate that our method improves GUI agents' multilingual P&R capability by 6.5% on average.
- North America > United States (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (2 more...)
More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering
Vu, Duc Anh, Nguyen, Thong, Nguyen, Cong-Duy, Nguyen, Viet Anh, Luu, Anh Tuan
With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models' reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.
- North America > United States > New York (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Greece > Central Macedonia > Thessaloniki (0.05)
- (5 more...)
- Questionnaire & Opinion Survey (1.00)
- Research Report > New Finding (0.93)
Beyond Multiple Choice: Verifiable OpenQA for Robust Vision-Language RFT
Liu, Yesheng, Li, Hao, Xu, Haiyu, Pei, Baoqi, Wang, Jiahao, Zhao, Mingxuan, Zheng, Jingshu, He, Zheqi, Yao, JG, Qin, Bowen, Yang, Xi, Zhang, Jiajun
Multiple-choice question answering (MCQA) has been a popular format for evaluating and reinforcement fine-tuning (RFT) of modern multimodal language models. Its constrained output format allows for simplified, deterministic automatic verification. However, we find that the options may leak exploitable signals, which makes the accuracy metrics unreliable for indicating real capabilities and encourages explicit or implicit answer guessing behaviors during RFT. We propose ReVeL (Rewrite and Verify by LLM), a framework that rewrites multiple-choice questions into open-form questions while keeping answers verifiable whenever possible. The framework categorizes questions according to different answer types, apply different rewriting and verification schemes, respectively. When applied for RFT, we converted 20k MCQA examples and use GRPO to finetune Qwen2.5-VL models. Models trained on ReVeL-OpenQA match MCQA accuracy on multiple-choice benchmarks and improve OpenQA accuracy by about six percentage points, indicating better data efficiency and more robust reward signals than MCQA-based training. When used for evaluation, ReVeL also reveals up to 20 percentage points of score inflation in MCQA benchmarks (relative to OpenQA), improves judging accuracy, and reduces both cost and latency. We will release code and data publicly.
- North America > United States > South Dakota (0.05)
- Europe > Spain (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (4 more...)
- Questionnaire & Opinion Survey (0.56)
- Research Report (0.50)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
HEAD-QA v2: Expanding a Healthcare Benchmark for Reasoning
Correa-Guillén, Alexis, Gómez-Rodríguez, Carlos, Vilares, David
We introduce HEAD-QA v2, an expanded and updated version of a Spanish/English healthcare multiple-choice reasoning dataset originally released by Vilares and Gómez-Rodríguez (2019). The update responds to the growing need for high-quality datasets that capture the linguistic and conceptual complexity of healthcare reasoning. We extend the dataset to over 12,000 questions from ten years of Spanish professional exams, benchmark several open-source LLMs using prompting, RAG, and probability-based answer selection, and provide additional multilingual versions to support future work. Results indicate that performance is mainly driven by model scale and intrinsic reasoning ability, with complex inference strategies obtaining limited gains. Together, these results establish HEAD-QA v2 as a reliable resource for advancing research on biomedical reasoning and model improvement.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.97)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
- Law (1.00)
- Government (0.69)