AITopics | finben

FinBen: A Holistic Financial Benchmark for Large Language Models

Neural Information Processing SystemsMar-22-2026, 01:37:22 GMT

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovations in financial LLMs. All datasets and code are publicly available for the research community, with results shared and updated regularly on the Open Financial LLM Leaderboard.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Additional Results

Neural Information Processing SystemsFeb-17-2026, 10:09:08 GMT

The acronym dataset is a QA task that requires models to decode financial acronyms. The FinMA7B-full model achieved the highest ROUGE-1 score of 0.12 and the B.1 Why was the datasheet created? B.2 Has the dataset been used already? If so, where are the results so others can compare (e.g., links to published papers)? Y es, the dataset has already been used. It was employed in the FinLLM Share Task during the FinNLP-AgentScen Workshop at IJCAI 2024, known as the FinLLM Challenge.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
(5 more...)

Genre: Research Report (0.93)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (1.00)
Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(4 more...)

Add feedback

FinBen: A Holistic Financial Benchmark for Large Language Models

Neural Information Processing SystemsFeb-17-2026, 10:09:05 GMT

Their novel solutions outperformed GPT -4, showcasing FinBen's potential to drive innovations in financial LLMs.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > Taiwan (0.04)
Asia > Middle East > Jordan (0.04)
Oceania > Australia (0.04)
(14 more...)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.45)

Industry:

Information Technology (1.00)
Government (1.00)
Banking & Finance > Trading (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

adb1d9fa8be4576d28703b396b82ba1b-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 13:05:32 GMT

dataset, information, university, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Hubei Province > Wuhan (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
(5 more...)

Genre: Research Report (0.93)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Banking & Finance > Trading (1.00)
Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(4 more...)

Add feedback

FinBen: A Holistic Financial Benchmark for Large Language Models

Neural Information Processing SystemsOct-10-2025, 13:05:29 GMT

Their novel solutions outperformed GPT -4, showcasing FinBen's potential to drive innovations in financial LLMs.

arxiv preprint arxiv, dataset, llm, (14 more...)

Neural Information Processing Systems

Country:

Asia > Taiwan (0.04)
Asia > Middle East > Jordan (0.04)
Oceania > Australia (0.04)
(14 more...)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.45)

Industry:

Information Technology (1.00)
Government (1.00)
Banking & Finance > Trading (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FinBen: A Holistic Financial Benchmark for Large Language Models

Neural Information Processing SystemsMay-27-2025, 12:32:36 GMT

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting.

finben, holistic financial benchmark, textual analysis, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

FinBen: A Holistic Financial Benchmark for Large Language Models

Xie, Qianqian, Han, Weiguang, Chen, Zhengyu, Xiang, Ruoyu, Zhang, Xiao, He, Yueru, Xiao, Mengxi, Li, Dong, Dai, Yongfu, Feng, Duanyu, Xu, Yijing, Kang, Haoqiang, Kuang, Ziyan, Yuan, Chenhan, Yang, Kailai, Luo, Zheheng, Zhang, Tianlin, Liu, Zhiwei, Xiong, Guojun, Deng, Zhiyang, Jiang, Yuechen, Yao, Zhiyuan, Li, Haohang, Yu, Yangyang, Hu, Gang, Huang, Jiajia, Liu, Xiao-Yang, Lopez-Lira, Alejandro, Wang, Benyou, Lai, Yanzhao, Wang, Hao, Peng, Min, Ananiadou, Sophia, Huang, Jimin

arXiv.org Artificial IntelligenceJun-18-2024

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs.

arxiv preprint arxiv, dataset, llm, (15 more...)

arXiv.org Artificial Intelligence

2402.12659

Country: