Goto

Collaborating Authors

 Banking & Finance


A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data

Neural Information Processing Systems

Tabular data is prevalent in real-world machine learning applications, and new models for supervised learning of tabular data are frequently proposed. Comparative studies assessing the performance of models typically consist of model-centric evaluation setups with overly standardized data preprocessing. This paper demonstrates that such model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing, which includes feature engineering. Therefore, we propose a data-centric evaluation framework. We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset. We conduct experiments with different preprocessing pipelines and hyperparameter optimization (HPO) regimes to quantify the impact of model selection, HPO, feature engineering, and test-time adaptation. Our main findings are: 1. After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.


A Additional Results

Neural Information Processing Systems

The performance analysis of other LLMs on the acronym and regulations tasks, as shown in Tables 1 and 2, provides valuable insights into their capabilities. The acronym dataset is a QA task that requires models to decode financial acronyms. Despite not having seen this task before, FinMA, a financial LLM specially trained on financial tasks, performed exceptionally well. The FinMA7B-full model achieved the highest ROUGE-1 score of 0.12 and the highest BERTScore of 0.73, even surpassing GPT-4. This indicates that financial-specific models can leverage their domain knowledge effectively, even on short QA tasks like the acronym dataset. On the other hand, the regulations dataset involves answering intricate questions related to financial regulations, such as EMIR. This task is long, complex, and difficult to understand, posing a significant challenge. In this scenario, the LLaMA2-70b-chat model stand out with a ROUGE-1 score of 0.30 and a BERTScore of 0.68, highlighting its ability to handle complex regulatory questions. This underscores the importance of model size and capability when dealing with more demanding and sophisticated tasks in the financial domain. The best performance is in bold. The best performance is in bold. B.1 Why was the datasheet created? FinBen was created to address the gap in comprehensive benchmarks and evaluation studies of large language models within the financial domain. Despite the proven capabilities of LLMs such as GPT-4 in transforming various fields including finance, a detailed understanding of their potential and limitations specific to finance is still lacking. This is partly due to the complex and specialized nature of financial tasks, which necessitates targeted datasets for thorough analysis. By evaluating 42 datasets covering 24 financial tasks, we aim to provide a robust benchmark that allows researchers and practitioners to evaluate the effectiveness of LLMs in financial text analysis and prediction tasks more accurately and reliably.


FinBen: A Holistic Financial Benchmark for Large Language Models

Neural Information Processing Systems

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovations in financial LLMs.


Group Retention when Using Machine Learning in Sequential Decision Making: the Interplay between User Dynamics and Fairness Mohammad Mahdi Khalili

Neural Information Processing Systems

Machine Learning (ML) models trained on data from multiple demographic groups can inherit representation disparity [7] that may exist in the data: the model may be less favorable to groups contributing less to the training process; this in turn can degrade population retention in these groups over time, and exacerbate representation disparity in the long run. In this study, we seek to understand the interplay between ML decisions and the underlying group representation, how they evolve in a sequential framework, and how the use of fairness criteria plays a role in this process. We show that the representation disparity can easily worsen over time under a natural user dynamics (arrival and departure) model when decisions are made based on a commonly used objective and fairness criteria, resulting in some groups diminishing entirely from the sample pool in the long run. It highlights the fact that fairness criteria have to be defined while taking into consideration the impact of decisions on user dynamics. Toward this end, we explain how a proper fairness criterion can be selected based on a general user dynamics model.


ac4106bcfff33140de7799d03daeb8a4-Paper-Conference.pdf

Neural Information Processing Systems

Cross-Validation (CV) is the default choice for estimate the out-of-sample performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that, in terms of estimating the out-of-sample performances, for a wide spectrum of models, CV does not statistically outperform the simple "plug-in" approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, K-fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases leave-one-out again does not outperform plug-in once this variability is taken into account. We obtain our theoretical comparisons via a novel higher-order Taylor analysis that dissects the limit theorems of testing evaluations, which applies to model classes that are not amenable to previously known sufficient conditions. Our numerical results demonstrate that plug-in performs indeed no worse than CV in estimating model performance across a wide range of examples.


The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the xAUC Metric

Neural Information Processing Systems

Where machine-learned predictive risk scores inform high-stakes decisions, such as bail and sentencing in criminal justice, fairness has been a serious concern. Recent work has characterized the disparate impact that such risk scores can have when used for a binary classification task. This may not account, however, for the more diverse downstream uses of risk scores and their non-binary nature. To better account for this, in this paper, we investigate the fairness of predictive risk scores from the point of view of a bipartite ranking task, where one seeks to rank positive examples higher than negative ones. We introduce the xAUC disparity as a metric to assess the disparate impact of risk scores and define it as the difference in the probabilities of ranking a random positive example from one protected group above a negative one from another group and vice versa. We provide a decomposition of bipartite ranking loss into components that involve the discrepancy and components that involve pure predictive ability within each group. We use xAUC analysis to audit predictive risk scores for recidivism prediction, income prediction, and cardiac arrest prediction, where it describes disparities that are not evident from simply comparing within-group predictive performance.


Improved Analysis for Bandit Learning in Matching Markets

Neural Information Processing Systems

A rich line of works study the bandit learning problem in two-sided matching markets, where one side of market participants (players) are uncertain about their preferences and hope to find a stable matching during iterative matchings with the other side (arms).



Towards Application-Driven and Comprehensive Data Analysis via Code Generation

Neural Information Processing Systems

Data analysis is a crucial analytical process essential for deriving insights from realworld databases. As shown in Figure 1, the need for data analysis typically arises from specific application scenarios, and requires diverse reasoning skills including mathematical reasoning, logical reasoning, and strategic reasoning. Existing work often focus on simple factual retrieval or arithmetic resolutions and thus are insufficient for addressing complex real-world queries. This work aims to propose new resources and benchmarks on this crucial yet challenging and under-explored task. Due to the prohibitively high cost of collecting expert annotations, we use large language models (LLMs) enhanced by code generation to automatically generate high-quality data analysis, which will later be refined by human annotators.


CHIP: A Hawkes Process Model for Continuous-time Networks with Scalable and Consistent Estimation

Neural Information Processing Systems

In many application settings involving networks, such as messages between users of an on-line social network or transactions between traders in financial markets, the observed data consist of timestamped relational events, which form a continuoustime network. We propose the Community Hawkes Independent Pairs (CHIP) generative model for such networks. We show that applying spectral clustering to an aggregated adjacency matrix constructed from the CHIP model provides consistent community detection for a growing number of nodes and time duration. We also develop consistent and computationally efficient estimators for the model parameters. We demonstrate that our proposed CHIP model and estimation procedure scales to large networks with tens of thousands of nodes and provides superior fits than existing continuous-time network models on several real networks.