Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III

Shetty, Pranam, Upadhayaya, Abhisek, Shah, Parth Mitesh, Jagabathula, Srikanth, Nayak, Shilpi, Fee, Anna Joo

arXiv.org Artificial Intelligence 

As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. For advanced financial reasoning, the Chartered Financial Analyst (CFA) Level III exam is widely considered the gold standard. In this paper, we present a comprehensive benchmark evaluating 23 state-of-the-art LLMs on mock CFA Level III exams, which require answering challenging multiple choice and essay questions. We evaluate reasoning and non-reasoning models, both proprietary and open source, using three prompting strategies: zero-shot, chain-of-thought, and self-discover. We find that frontier reasoning models, such as o4-mini, Gemini 2.5 Pro, and Claude Opus 4, using chain-of-thought prompting demonstrate strong capabilities, successfully passing the mock Level III exams. While there is little to separate the frontier models on multiple choice questions, only a few models excel at the complex essay questions, which require analysis, synthesis, and strategic thinking. These results demonstrate significant progress in the financial reasoning capabilities of LLMs, which previously [13] could clear Level I and Level II exams but struggled with the Level III exam, particularly the essay questions.