Goto

Collaborating Authors

 excess return


Uncertainty-Adjusted Sorting for Asset Pricing with Machine Learning

Liu, Yan, Luo, Ye, Wang, Zigan, Zhang, Xiaowei

arXiv.org Machine Learning

A large and rapidly expanding literature demonstrates that machine learning (ML) methods substantially improve out-of-sample asset return prediction relative to conventional linear benchmarks, and that these statistical gains often translate into economically meaningful portfolio performance. Seminal contributions such as Gu et al. (2020) document large Sharpe ratio improvements from nonlinear learners in U.S. equities, while subsequent work extends these findings to stochastic discount factor estimation (Chen et al. 2024), international equity markets (Leippold et al. 2022), and bond return forecasting (Kelly et al. 2019, Bianchi et al. 2020). Collectively, this literature establishes ML as a powerful tool for extracting conditional expected returns in environments characterized by noisy signals, nonlinear interactions, and pervasive multicollinearity.


Limits To (Machine) Learning

Chen, Zhimin, Kelly, Bryan, Malamud, Semyon

arXiv.org Machine Learning

Machine learning (ML) methods are highly flexible, but their ability to approximate the true data-generating process is fundamentally constrained by finite samples. We characterize a universal lower bound, the Limits-to-Learning Gap (LLG), quantifying the unavoidable discrepancy between a model's empirical fit and the population benchmark. Recovering the true population $R^2$, therefore, requires correcting observed predictive performance by this bound. Using a broad set of variables, including excess returns, yields, credit spreads, and valuation ratios, we find that the implied LLGs are large. This indicates that standard ML approaches can substantially understate true predictability in financial data. We also derive LLG-based refinements to the classic Hansen and Jagannathan (1991) bounds, analyze implications for parameter learning in general-equilibrium settings, and show that the LLG provides a natural mechanism for generating excess volatility.


Diffolio: A Diffusion Model for Multivariate Probabilistic Financial Time-Series Forecasting and Portfolio Construction

Cho, So-Yoon, Kim, Jin-Young, Ban, Kayoung, Koo, Hyeng Keun, Kim, Hyun-Gyoon

arXiv.org Artificial Intelligence

Probabilistic forecasting is crucial in multivariate financial time-series for constructing efficient portfolios that account for complex cross-sectional dependencies. In this paper, we propose Diffolio, a diffusion model designed for multivariate financial time-series forecasting and portfolio construction. Diffolio employs a denoising network with a hierarchical attention architecture, comprising both asset-level and market-level layers. Furthermore, to better reflect cross-sectional correlations, we introduce a correlation-guided regularizer informed by a stable estimate of the target correlation matrix. This structure effectively extracts salient features not only from historical returns but also from asset-specific and systematic covariates, significantly enhancing the performance of forecasts and portfolios. Experimental results on the daily excess returns of 12 industry portfolios show that Diffolio outperforms various probabilistic forecasting baselines in multivariate forecasting accuracy and portfolio performance. Moreover, in portfolio experiments, portfolios constructed from Diffolio's forecasts show consistently robust performance, thereby outperforming those from benchmarks by achieving higher Sharpe ratios for the mean-variance tangency portfolio and higher certainty equivalents for the growth-optimal portfolio. These results demonstrate the superiority of our proposed Diffolio in terms of not only statistical accuracy but also economic significance.


Structure Over Signal: A Globalized Approach to Multi-relational GNNs for Stock Prediction

Li, Amber, Abil, Aruzhan, Oda, Juno Marques

arXiv.org Artificial Intelligence

In financial markets, Graph Neural Networks have been successfully applied to modeling relational data, effectively capturing nonlinear inter-stock dependencies. Yet, existing models often fail to efficiently propagate messages during macroeconomic shocks. In this paper, we propose OmniGNN, an attention-based multi-relational dynamic GNN that integrates macroeconomic context via heterogeneous node and edge types for robust message passing. Central to OmniGNN is a sector node acting as a global intermediary, enabling rapid shock propagation across the graph without relying on long-range multi-hop diffusion. The model leverages Graph Attention Networks (GAT) to weigh neighbor contributions and employs Transformers to capture temporal dynamics across multiplex relations. Experiments show that OmniGNN outperforms existing stock prediction models on public datasets, particularly demonstrating strong robustness during the COVID-19 period.


ELATE: Evolutionary Language model for Automated Time-series Engineering

Murray, Andrew, Dervovic, Danial, Cashmore, Michael

arXiv.org Artificial Intelligence

Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.


The Uncertainty of Machine Learning Predictions in Asset Pricing

Liao, Yuan, Ma, Xinjie, Neuhierl, Andreas, Schilling, Linda

arXiv.org Machine Learning

Recently, machine learning (ML) models have gained prominence in predicting asset returns, selecting portfolios, and estimating stochastic discount factors, with significant success in these areas. ML techniques, by capturing complex and nonlinear relationships in financial data, are particularly well-suited for enhancing portfolio management decisions. For example, within the mean-variance portfolio framework, ML methods are increasingly used to estimate expected returns and (co)variances, often leading to more effective portfolio allocations. The literature consistently demonstrates the effectiveness of machine learning in these and other applications (e.g., Gu, Kelly, and Xiu (2020); Bianchi, B uchner, and Tamoni (2021); Cong, Tang, Wang, and Zhang (2021); Kelly, Malamud, and Zhou (2021); Patton and Weller (2022); Didisheim, Ke, Kelly, and Malamud (2023); Filipovic and Schneider (2024)). Despite the success of machine learning in asset pricing, existing literature typically treats ML predictions as point estimates and conducts asset pricing analyses as if they were true values, overlooking the associated uncertainty. This is surprising, given that uncertainty about input parameters is widely acknowledged as critical in portfolio selection (e.g., DeMiguel, Garlappi, and Uppal (2009)), and Garlappi, Uppal, and Wang (2007) show that incorporating forecast uncertainty in mean-variance portfolio allocation leads to distinct economic insights. However, quantifying prediction uncertainty in ML forecasts, particularly with neural networks, remains a complex challenge, limiting their broader application in asset pricing.


Analyst Reports and Stock Performance: Evidence from the Chinese Market

Liu, Rui, Liang, Jiayou, Chen, Haolong, Hu, Yujia

arXiv.org Artificial Intelligence

This article applies natural language processing (NLP) to extract and quantify textual information to predict stock performance. Using an extensive dataset of Chinese analyst reports and employing a customized BERT deep learning model for Chinese text, this study categorizes the sentiment of the reports as positive, neutral, or negative. The findings underscore the predictive capacity of this sentiment indicator for stock volatility, excess returns, and trading volume. Specifically, analyst reports with strong positive sentiment will increase excess return and intraday volatility, and vice versa, reports with strong negative sentiment also increase volatility and trading volume, but decrease future excess return. The magnitude of this effect is greater for positive sentiment reports than for negative sentiment reports. This article contributes to the empirical literature on sentiment analysis and the response of the stock market to news in the Chinese stock market.


AAPM: Large Language Model Agent-based Asset Pricing Models

Cheng, Junyan, Chin, Peter

arXiv.org Artificial Intelligence

In this study, we propose a novel asset pricing approach, LLM Agent-based Asset Pricing Models (AAPM), which fuses qualitative discretionary investment analysis from LLM agents and quantitative manual financial economic factors to predict excess asset returns. The experimental results show that our approach outperforms machine learning-based asset pricing baselines in portfolio optimization and asset pricing errors. Specifically, the Sharpe ratio and average $|\alpha|$ for anomaly portfolios improved significantly by 9.6\% and 10.8\% respectively. In addition, we conducted extensive ablation studies on our model and analysis of the data to reveal further insights into the proposed method.


From attention to profit: quantitative trading strategy based on transformer

Zhang, Zhaofeng, Chen, Banghao, Zhu, Shengxin, Langrené, Nicolas

arXiv.org Artificial Intelligence

In traditional quantitative trading practice, navigating the complicated and dynamic financial market presents a persistent challenge. Former machine learning approaches have struggled to fully capture various market variables, often ignore long-term information and fail to catch up with essential signals that may lead the profit. This paper introduces an enhanced transformer architecture and designs a novel factor based on the model. By transfer learning from sentiment analysis, the proposed model not only exploits its original inherent advantages in capturing long-range dependencies and modelling complex data relationships but is also able to solve tasks with numerical inputs and accurately forecast future returns over a period. This work collects more than 5,000,000 rolling data of 4,601 stocks in the Chinese capital market from 2010 to 2019. The results of this study demonstrated the model's superior performance in predicting stock trends compared with other 100 factor-based quantitative strategies with lower turnover rates and a more robust half-life period. Notably, the model's innovative use transformer to establish factors, in conjunction with market sentiment information, has been shown to enhance the accuracy of trading signals significantly, thereby offering promising implications for the future of quantitative trading strategies.


Leveraging Large Language Model for Automatic Evolving of Industrial Data-Centric R&D Cycle

Yang, Xu, Yang, Xiao, Liu, Weiqing, Li, Jinhui, Yu, Peng, Ye, Zeqi, Bian, Jiang

arXiv.org Artificial Intelligence

In the wake of relentless digital transformation, data-driven solutions are emerging as powerful tools to address multifarious industrial tasks such as forecasting, anomaly detection, planning, and even complex decision-making. Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources. This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D. Assessing the foundational elements of data-centric R&D, including heterogeneous task-related data, multi-facet domain knowledge, and diverse computing-functional tools, we explore how well LLMs can understand domain-specific requirements, generate professional ideas, utilize domain-specific tools to conduct experiments, interpret results, and incorporate knowledge from past endeavors to tackle new challenges. We take quantitative investment research as a typical example of industrial data-centric R&D scenario and verified our proposed framework upon our full-stack open-sourced quantitative research platform Qlib and obtained promising results which shed light on our vision of automatic evolving of industrial data-centric R&D cycle.