Dai, Yongfu
FinBen: A Holistic Financial Benchmark for Large Language Models
Xie, Qianqian, Han, Weiguang, Chen, Zhengyu, Xiang, Ruoyu, Zhang, Xiao, He, Yueru, Xiao, Mengxi, Li, Dong, Dai, Yongfu, Feng, Duanyu, Xu, Yijing, Kang, Haoqiang, Kuang, Ziyan, Yuan, Chenhan, Yang, Kailai, Luo, Zheheng, Zhang, Tianlin, Liu, Zhiwei, Xiong, Guojun, Deng, Zhiyang, Jiang, Yuechen, Yao, Zhiyuan, Li, Haohang, Yu, Yangyang, Hu, Gang, Huang, Jiajia, Liu, Xiao-Yang, Lopez-Lira, Alejandro, Wang, Benyou, Lai, Yanzhao, Wang, Hao, Peng, Min, Ananiadou, Sophia, Huang, Jimin
LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive evaluation benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 36 datasets spanning 24 financial tasks, covering seven critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, and decision-making. FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and three novel open-source evaluation datasets for text summarization, question answering, and stock trading. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovation in financial LLMs.
Empowering Many, Biasing a Few: Generalist Credit Scoring through Large Language Models
Feng, Duanyu, Dai, Yongfu, Huang, Jimin, Zhang, Yifang, Xie, Qianqian, Han, Weiguang, Lopez-Lira, Alejandro, Wang, Hao
In the financial industry, credit scoring is a fundamental element, shaping access to credit and determining the terms of loans for individuals and businesses alike. Traditional credit scoring methods, however, often grapple with challenges such as narrow knowledge scope and isolated evaluation of credit tasks. Our work posits that Large Language Models (LLMs) have great potential for credit scoring tasks, with strong generalization ability across multiple tasks. To systematically explore LLMs for credit scoring, we propose the first open-source comprehensive framework. We curate a novel benchmark covering 9 datasets with 14K samples, tailored for credit assessment and a critical examination of potential biases within LLMs, and the novel instruction tuning data with over 45k samples. We then propose the first Credit and Risk Assessment Large Language Model (CALM) by instruction tuning, tailored to the nuanced demands of various financial risk assessment tasks. We evaluate CALM, and existing state-of-art (SOTA) open source and close source LLMs on the build benchmark. Our empirical results illuminate the capability of LLMs to not only match but surpass conventional models, pointing towards a future where credit scoring can be more inclusive, comprehensive, and unbiased. We contribute to the industry's transformation by sharing our pioneering instruction-tuning datasets, credit and risk assessment LLM, and benchmarks with the research community and the financial industry.
LAiW: A Chinese Legal Large Language Models Benchmark (A Technical Report)
Dai, Yongfu, Feng, Duanyu, Huang, Jimin, Jia, Haochen, Xie, Qianqian, Zhang, Yifang, Han, Weiguang, Tian, Wei, Wang, Hao
With the emergence of numerous legal LLMs, there is currently a lack of a comprehensive benchmark for evaluating their legal abilities. In this paper, we propose the first Chinese Legal LLMs benchmark based on legal capabilities. Through the collaborative efforts of legal and artificial intelligence experts, we divide the legal capabilities of LLMs into three levels: basic legal NLP capability, basic legal application capability, and complex legal application capability. We have completed the first phase of evaluation, which mainly focuses on the capability of basic legal NLP. The evaluation results show that although some legal LLMs have better performance than their backbones, there is still a gap compared to ChatGPT. Our benchmark can be found at URL.