Benchmark Datasets for Lead-Lag Forecasting on Social Platforms
Kazemian, Kimia, Liu, Zhenzhen, Yang, Yangfanyu, Luo, Katie Z, Gu, Shuhan, Du, Audrey, Yang, Xinyu, Jansons, Jack, Weinberger, Kilian Q, Thickstun, John, Yin, Yian, Dean, Sarah
–arXiv.org Artificial Intelligence
Social and collaborative platforms emit multivariate time-series traces in which early interactions--such as views, likes, or downloads--are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets--arXiv (accesses citations of 2.3M papers) and GitHub (pushes/stars forks of 3M repositories)--and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page-views edits), Spotify (streams concert attendance), e-commerce (click-throughs purchases), and LinkedIn profile (views messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding sur-vivorship bias in sampling. We documented all technical details of data cura-tion and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. The success of human activities is often measured by their collective impact, ranging from music streams and movie box office revenues to product sales and social media popularity. These impact metrics typically follow heavy-tailed distributions (Clauset et al., 2009) and slow decay patterns across timescales (Candia et al., 2019), making early identification of future hits fundamentally challenging (Cheng et al., 2014; Martin et al., 2016). At the same time, digital platforms increasingly log online user interactions--searches, views, downloads, likes, and shares--that often precede these long-term dynamics. These temporal lead-lag dynamics are remarkably ubiquitous, spanning domains as diverse as science (Haque & Ginsparg, 2009), economics (Wu & Brynjolfsson, 2015), arts (Goel et al., 2010), culture (Gruhl et al., 2005), and social movements (Johnson et al., 2016). A systematic understanding of such lead-lag dynamics is not only crucial for anticipating and optimizing impact in digital ecosystems, but also essential for designing effective strategies that identify and promote emerging innovations and products.
arXiv.org Artificial Intelligence
Nov-7-2025
- Country:
- North America > United States
- California (0.14)
- Illinois > Cook County
- Chicago (0.04)
- North America > United States
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Energy
- Power Industry (0.68)
- Renewable (0.47)
- Government > Regional Government (0.68)
- Information Technology (1.00)
- Transportation > Ground (0.47)
- Energy
- Technology: