large-scale study
Are GANs Created Equal? A Large-Scale Study
Generative adversarial networks (GAN) are a powerful subclass of generative models. Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures.
A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation
Leite, João A., Arora, Arnav, Gargova, Silvia, Luz, João, Sampaio, Gustavo, Roberts, Ian, Scarton, Carolina, Bontcheva, Kalina
While Large Language Models (LLMs) have made agentic AI, chatbots, and other intelligent applications possible, they have also enabled the affordable creation of highly convincing AI-generated disinformation (Bontcheva et al., 2024), which poses a systemic risk to democratic stability and global security (VIGINUM, 2025; Bengio, 2025). Initially, AI-generated texts suffered from linguistic mistakes and thus were more easily detectable by humans. However, modern LLMs, particularly instruction-tuned models, have significantly improved in producing outputs which are indistinguishable from human-written text (Spitale et al., 2023; Heppell et al., 2024). These advances have resulted in their misuse in generating persuasive disinformation narratives, including political manipulation, health disinformation, conspiracy propagation, and Foreign Information Manipulation and Interference (FIMI) (Vykopal et al., 2024; Chen and Shu, 2024a; Barman et al., 2024; Chen and Shu, 2024b; Heppell et al., 2024; VIGINUM, 2025). While there is a growing body of research on the generation and detection of LLM-produced disinformation (Chen and Shu, 2024a; Lucas et al., 2023; Vykopal et al., 2024; Heppell et al., 2024), a critical aspect remains largely unstudied - namely, whether LLMs are capable of generating fluent and convincing personalised disinformation (i.e., disinformation narratives tailored to specific audiences) in multiple languages and at scale. The few prior studies on AIgenerated personalised disinformation are limited to English and address a very narrow set of personas (e.g., students, parents) (Zugecova et al., 2024). Crucially, prior work has not yet examined whether LLMs can adapt disinformation to country-specific linguistic and cultural contexts in multiple languages.
- Europe > United Kingdom (0.28)
- Europe > Ukraine (0.14)
- North America > United States > Texas > Travis County > Austin (0.14)
- (22 more...)
- Media > News (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.67)
Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference
Finkelshtein, Ben, Cucerzan, Silviu, Jauhar, Sujay Kumar, White, Ryen
Large language models (LLMs) are increasingly used for text-rich graph machine learning tasks such as node classification in high-impact domains like fraud detection and recommendation systems. Yet, despite a surge of interest, the field lacks a principled understanding of the capabilities of LLMs in their interaction with graph data. In this work, we conduct a large-scale, controlled evaluation across several key axes of variability to systematically assess the strengths and weaknesses of LLM-based graph reasoning methods in text-based applications. The axes include the LLM-graph interaction mode, comparing prompting, tool-use, and code generation; dataset domains, spanning citation, web-link, e-commerce, and social networks; structural regimes contrasting homophilic and heterophilic graphs; feature characteristics involving both short- and long-text node attributes; and model configurations with varying LLM sizes and reasoning capabilities. We further analyze dependencies by methodically truncating features, deleting edges, and removing labels to quantify reliance on input types. Our findings provide practical and actionable guidance. (1) LLMs as code generators achieve the strongest overall performance on graph data, with especially large gains on long-text or high-degree graphs where prompting quickly exceeds the token budget. (2) All interaction strategies remain effective on heterophilic graphs, challenging the assumption that LLM-based methods collapse under low homophily. (3) Code generation is able to flexibly adapt its reliance between structure, features, or labels to leverage the most informative input type. Together, these findings provide a comprehensive view of the strengths and limitations of current LLM-graph interaction modes and highlight key design principles for future approaches.
- North America > United States > Texas (0.05)
- North America > United States > Wisconsin (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Reviews: Are GANs Created Equal? A Large-Scale Study
This paper introduces a large set of experiments to compare recently proposed GANs. It discusses two previously proposed measures -- inception score (IS) and Frechet Inception Distance (FID); and it proposes a new measure in the context of GAN assessment, based on precision, recall and F1. Precision (P) is measured as the fraction of generated samples with distance below a pre-defined threshold \delta; while recall (R) is measured as the fraction of inversely generated samples (from test set) with squared Euclidean distance below \delta (F1 is the usual mean between P and R). The paper argues that IS only measures precision and FIS measures both, so IS is essentially dropped as a measurement for GANs. Then the paper argues that it is important to show the mean and variance of FID and P-R-F1 measurements instead of the best values, computed over a set of random initialisations and hyper-parameter search points.
Large-Scale Study of Temporal Shift in Health Insurance Claims
Ji, Christina X, Alaa, Ahmed M, Sontag, David
Most machine learning models for predicting clinical outcomes are developed using historical data. Yet, even if these models are deployed in the near future, dataset shift over time may result in less than ideal performance. To capture this phenomenon, we consider a task--that is, an outcome to be predicted at a particular time point--to be non-stationary if a historical model is no longer optimal for predicting that outcome. We build an algorithm to test for temporal shift either at the population level or within a discovered sub-population. Then, we construct a meta-algorithm to perform a retrospective scan for temporal shift on a large collection of tasks. Our algorithms enable us to perform the first comprehensive evaluation of temporal shift in healthcare to our knowledge. We create 1,010 tasks by evaluating 242 healthcare outcomes for temporal shift from 2015 to 2020 on a health insurance claims dataset. 9.7% of the tasks show temporal shifts at the population level, and 93.0% have some sub-population affected by shifts. We dive into case studies to understand the clinical implications. Our analysis highlights the widespread prevalence of temporal shifts in healthcare.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > United States > Pennsylvania (0.04)
- North America > United States > New Jersey (0.04)
- (3 more...)
Large-Scale Study of Curiosity-Driven Learning
Reinforcement learning algorithms rely on carefully engineering environment rewards that are extrinsic to the agent. However, annotating each environment with hand-designed, dense rewards is not scalable, motivating the need for developing reward functions that are intrinsic to the agent. Curiosity is a type of intrinsic reward function which uses prediction error as reward signal. In this paper: (a) We perform the first large-scale study of purely curiosity-driven learning, i.e. without any extrinsic rewards, across 54 standard benchmark environments, including the Atari game suite. Our results show surprisingly good performance, and a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments.
Who Does What on the Web: A Large-Scale Study of Browsing Behavior
Goel, Sharad (Yahoo! Research) | Hofman, Jake M. (Yahoo! Research) | Sirer, M. Irmak (Northwestern University)
As the Web has become integrated into daily life, understanding how individuals spend their time online impacts domains ranging from public policy to marketing. It is difficult, however, to measure even simple aspects of browsing behavior via conventional methods---including surveys and site-level analytics---due to limitations of scale and scope. In part addressing these limitations, large-scale Web panel data are a relatively novel means for investigating patterns of Internet usage. In one of the largest studies of browsing behavior to date, we pair Web histories for 250,000 anonymized individuals with user-level demographics---including age, sex, race, education, and income---to investigate three topics. First, we examine how behavior changes as individuals spend more time online, showing that the heaviest users devote nearly twice as much of their time to social media relative to typical individuals. Second, we revisit the digital divide, finding that the frequency with which individuals turn to the Web for research, news, and healthcare is strongly related to educational background, but not as closely tied to gender and ethnicity. Finally, we demonstrate that browsing histories are a strong signal for inferring user attributes, including ethnicity and household income, a result that may be leveraged to improve ad targeting.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Evanston (0.04)
- Health & Medicine (0.66)
- Education > Educational Setting > Higher Education (0.49)
- Information Technology > Services (0.47)