Goto

Collaborating Authors

 structure function


Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap's and Zipf's laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors observed in LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.


Loss-Complexity Landscape and Model Structure Functions

arXiv.org Artificial Intelligence

We develop a framework for dualizing the Kolmogorov structure function $h_x(ฮฑ)$, which then allows using computable complexity proxies. We establish a mathematical analogy between information-theoretic constructs and statistical mechanics, introducing a suitable partition function and free energy functional. We explicitly prove the Legendre-Fenchel duality between the structure function and free energy, showing detailed balance of the Metropolis kernel, and interpret acceptance probabilities as information-theoretic scattering amplitudes. A susceptibility-like variance of model complexity is shown to peak precisely at loss-complexity trade-offs interpreted as phase transitions. Practical experiments with linear and tree-based regression models verify these theoretical predictions, explicitly demonstrating the interplay between the model complexity, generalization, and overfitting threshold.


Machine Learning Neutrino-Nucleus Cross Sections

arXiv.org Artificial Intelligence

Neutrino-nucleus scattering cross sections are critical theoretical inputs for long-baseline neutrino oscillation experiments. However, robust modeling of these cross sections remains challenging. For a simple but physically motivated toy model of the DUNE experiment, we demonstrate that an accurate neural-network model of the cross section -- leveraging Standard Model symmetries -- can be learned from near-detector data. We then perform a neutrino oscillation analysis with simulated far-detector events, finding that the modeled cross section achieves results consistent with what could be obtained if the true cross section were known exactly. This proof-of-principle study highlights the potential of future neutrino near-detector datasets and data-driven cross-section models.


The cell signaling structure function

arXiv.org Artificial Intelligence

Live cell microscopy captures 5-D $(x,y,z,channel,time)$ movies that display patterns of cellular motion and signaling dynamics. We present here an approach to finding spatiotemporal patterns of cell signaling dynamics in 5-D live cell microscopy movies unique in requiring no a priori knowledge of expected pattern dynamics, and no training data. The proposed cell signaling structure function (SSF) is a Kolmogorov structure function that optimally measures cell signaling state as nuclear intensity w.r.t. surrounding cytoplasm, a significant improvement compared to the current state-of-the-art cytonuclear ratio. SSF kymographs store at each spatiotemporal cell centroid the SSF value, or a functional output such as velocity. Patterns of similarity are identified via the metric normalized compression distance (NCD). The NCD is a reproducing kernel for a Hilbert space that represents the input SSF kymographs as points in a low dimensional embedding that optimally captures the pattern similarity identified by the NCD throughout the space. The only parameter is the expected cell radii ($\mu m$). A new formulation of the cluster structure function optimally estimates how meaningful an embedding from the RKHS representation. Results are presented quantifying the impact of ERK and AKT signaling between different oncogenic mutations, and by the relation between ERK signaling and cellular velocity patterns for movies of 2-D monolayers of human breast epithelial (MCF10A) cells, 3-D MCF10A spheroids under optogenetic manipulation of ERK, and human induced pluripotent stem cells .


Neural network based generation of a 1-dimensional stochastic field with turbulent velocity statistics

arXiv.org Machine Learning

We define and study a fully-convolutional neural network stochastic model, NN-Turb, which generates a 1-dimensional field with some turbulent velocity statistics. In particular, the generated process satisfies the Kolmogorov 2/3 law for second order structure function. It also presents negative skewness across scales (i.e. Kolmogorov 4/5 law) and exhibits intermittency as characterized by skewness and flatness. Furthermore, our model is never in contact with turbulent data and only needs the desired statistical behavior of the structure functions across scales for training.


Turbulence Scaling from Deep Learning Diffusion Generative Models

arXiv.org Artificial Intelligence

Complex spatial and temporal structures are inherent characteristics of turbulent fluid flows and comprehending them poses a major challenge. This comprehesion necessitates an understanding of the space of turbulent fluid flow configurations. We employ a diffusion-based generative model to learn the distribution of turbulent vorticity profiles and generate snapshots of turbulent solutions to the incompressible Navier-Stokes equations. We consider the inverse cascade in two spatial dimensions and generate diverse turbulent solutions that differ from those in the training dataset. We analyze the statistical scaling properties of the new turbulent profiles, calculate their structure functions, energy power spectrum, velocity probability distribution function and moments of local energy dissipation. All the learnt scaling exponents are consistent with the expected Kolmogorov scaling and have lower errors than the training ones. This agreement with established turbulence characteristics provides strong evidence of the model's capability to capture essential features of real-world turbulence.


Synthetic Lagrangian Turbulence by Generative Diffusion Models

arXiv.org Artificial Intelligence

Lagrangian turbulence lies at the core of numerous applied and fundamental problems related to the physics of dispersion and mixing in engineering, bio-fluids, atmosphere, oceans, and astrophysics. Despite exceptional theoretical, numerical, and experimental efforts conducted over the past thirty years, no existing models are capable of faithfully reproducing statistical and topological properties exhibited by particle trajectories in turbulence. We propose a machine learning approach, based on a state-of-the-art Diffusion Model, to generate single-particle trajectories in three-dimensional turbulence at high Reynolds numbers, thereby bypassing the need for direct numerical simulations or experiments to obtain reliable Lagrangian data. Our model demonstrates the ability to quantitatively reproduce all relevant statistical benchmarks over the entire range of time scales, including the presence of fat tails distribution for the velocity increments, anomalous power law, and enhancement of intermittency around the dissipative scale. The model exhibits good generalizability for extreme events, achieving unprecedented intensity and rarity. This paves the way for producing synthetic high-quality datasets for pre-training various downstream applications of Lagrangian turbulence.


An Algorithmic Approach to Emergence

arXiv.org Artificial Intelligence

Emergence is a concept often referred to in the study of complex systems. Coined in 1875 by the philosopher George H. Lewes in his book Problems of Life and Mind [1], the term has ever since mainly been used in qualitative discussions [2, 3]. In most contexts, emergence refers to the phenomenon by which novel properties arise in a complex system which is composed of a large quantity of simpler subsystems that do not exhibit those novel properties by themselves, but rather through their collective interactions. The following citation from Wikipedia [4] reflects this popular idea: "For instance, the phenomenon of life as studied in biology is an emergent property of chemistry, and psychological phenomena emerge from the neurobiological phenomena of living things". For claims such as the above to have a precise meaning, an objective definition of emergence must be provided. Current definitions are framed around a qualitative evaluation of the "novelty" of properties exhibited by a system with respect


Towards a Numerical Proof of Turbulence Closure

arXiv.org Artificial Intelligence

The development of turbulence closure models, parametrizing the influence of small non-resolved scales on the dynamics of large resolved ones, is an outstanding theoretical challenge with vast applicative relevance. We present a closure, based on deep recurrent neural networks, that quantitatively reproduces, within statistical errors, Eulerian and Lagrangian structure functions and the intermittent statistics of the energy cascade, including those of subgrid fluxes. To achieve high-order statistical accuracy, and thus a stringent statistical test, we employ shell models of turbulence. Our results encourage the development of similar approaches for 3D Navier-Stokes turbulence. Turbulence is the chaotic and ubiquitous dynamics of fluids, almost unavoidable for high velocity flows. Key to a vast number of environmental and industrial flows [15], 3D turbulence is characterized by a nonlinear forward energy cascade from large scales, where energy is injected, to smaller scales, where it is dissipated via viscous friction [1].


Deep learning velocity signals allows to quantify turbulence intensity

arXiv.org Artificial Intelligence

CNR-IAC, Rome, Italy Abstract Turbulence, the ubiquitous and chaotic state of fluid motions, is characterized by strong and statistically nontrivial fluctuations of the velocity field, over a wide range of length-and timescales, and it can be quantitatively described only in terms of statistical averages. Strong non-stationarities hinder the possibility to achieve statistical convergence, making it impossible to define the turbulence intensity and, in particular, its basic dimensionless estimator, the Reynolds number. Here we show that by employing Deep Neural Networks (DNN) we can accurately estimate the Reynolds number within 15% accuracy, from a statistical sample as small as two large-scale eddy-turnover times. In contrast, physics-based statistical estimators are limited by the rate of convergence of the central limit theorem, and provide, for the same statistical sample, an error at least 100 times larger. Our findings open up new perspectives in the possibility to quantitatively define and, therefore, study highly non-stationary turbulent flows as ordinarily found in nature as well as in industrial processes. Turbulence is characterized by complex statistics of velocity fluctuations correlated over a wide range of temporal-and spatial-scales.