MultiFinBen: Benchmarking Large Language Models for Multilingual and Multimodal Financial Application

Peng, Xueqing, Qian, Lingfei, Wang, Yan, Xiang, Ruoyu, He, Yueru, Ren, Yang, Jiang, Mingyang, Zhang, Vincent Jim, Guo, Yuqing, Zhao, Jeff, He, Huan, Han, Yi, Feng, Yun, Jiang, Yuechen, Cao, Yupeng, Li, Haohang, Yu, Yangyang, Wang, Xiaoyu, Gao, Penglei, Lin, Shengyuan, Wang, Keyi, Yang, Shanshan, Zhao, Yilun, Liu, Zhiwei, Lu, Peng, Huang, Jerry, Wang, Suyuchen, Papadopoulos, Triantafillos, Giannouris, Polydoros, Soufleri, Efstathia, Chen, Nuo, Deng, Zhiyang, Fu, Heming, Zhao, Yijia, Lin, Mingquan, Qiu, Meikang, Smith, Kaleb E, Cohan, Arman, Liu, Xiao-Yang, Huang, Jimin, Xiong, Guojun, Lopez-Lira, Alejandro, Chen, Xi, Tsujii, Junichi, Nie, Jian-Yun, Ananiadou, Sophia, Xie, Qianqian

Oct-14-2025–arXiv.org Artificial Intelligence

Real-world financial analysis involves information across multiple languages and modalities, from reports and news to scanned filings and meeting recordings. Yet most existing evaluations of LLMs in finance remain text-only, monolingual, and largely saturated by current models. To bridge these gaps, we present MultiFinBen, the first expert-annotated multilingual (five languages) and multimodal (text, vision, audio) benchmark for evaluating LLMs in realistic financial contexts. MultiFinBen introduces two new task families: multilingual financial reasoning, which tests cross-lingual evidence integration from filings and news, and financial OCR, which extracts structured text from scanned documents containing tables and charts. Rather than aggregating all available datasets, we apply a structured, difficulty-aware selection based on advanced model performance, ensuring balanced challenge and removing redundant tasks. Evaluating 21 leading LLMs shows that even frontier multimodal models like GPT-4o achieve only 46.01% overall, stronger on vision and audio but dropping sharply in multilingual settings. These findings expose persistent limitations in multilingual, multimodal, and expert-level financial reasoning. All datasets, evaluation scripts, and leaderboards are publicly released.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- Asia (1.00)
- North America > United States (0.92)

Genre:
- Financial News (1.00)
- Research Report > New Finding (0.45)

Industry:
- Government (1.00)
- Banking & Finance > Trading (1.00)
- Information Technology > Security & Privacy (0.92)
- Law > Business Law (0.92)
- Health & Medicine (0.92)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found