readout
Information Processing Capacity of Stationary Physical Systems: Theory, Data-efficient Estimation Methods, and Photonic Demonstration
Ramachandran, Rahul Uma, Massar, Serge
Physical computing systems provide a promising route toward hardware-native machine learning, but their computational capabilities remain difficult to characterize in a principled, task-independent, and data-efficient way. We extend the Information Processing Capacity (IPC) framework to stationary physical computing systems and establish several fundamental results: individual capacities are bounded between zero and one, their sum over a complete basis is bounded by the number of readouts, and noise strictly reduces this bound. We address the finite-sample estimation of IPC and derive the asymptotic form of the systematic positive bias affecting naive estimators. Building on these results, we introduce data-efficient estimation methods based on Richardson extrapolation and Sobol quasi-random sampling. We validate the framework experimentally using a photonic computing system based on picosecond laser pulses propagating through a nonlinear optical fibre. By varying the laser power and fibre length, we observe systematic shifts of the IPC distribution toward higher-order nonlinear capacities induced by the Kerr effect. Finally, we demonstrate that the total IPC strongly correlates with performance on benchmark machine-learning tasks and provides a reliable estimate of the effective dimensionality of the system. These results establish IPC as a practical bridge between the intrinsic dynamics of physical computing systems and their machine-learning performance.
MaxSketch: Robust Distinct Counting in Streams via Random Projections
Tsikouras, Nikos, Caramanis, Constantine, Tzamos, Christos
Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only approximately similar -- for example, different images of the same individual may vary significantly at the pixel level. Classical sketches such as HyperLogLog rely on consistent hash values for identical elements and break down in this regime. Recent work on robust distinct counting in general metric spaces achieves $\widetildeΘ(\sqrt{n})$ memory, which is tight in the worst case. We show that substantially improved memory guarantees are possible under geometric structure common in learned representations. We introduce MaxSketch, a simple max-linear sketch built from random Gaussian projections, and prove that it succeeds in estimating the number of distinct latent objects. Concretely, we show that under this assumption $m = \widetilde{O} (\log n / \varepsilon^2)$ random projections (and hence $\widetilde{O} (\log n/\varepsilon^2)$ memory) suffice to recover the true distinct count within a $(1+\varepsilon)$ factor. Experiments on image streams confirm that MaxSketch accurately estimates distinct counts and generalizes beyond the training regime. Our results bridge classical streaming algorithms and modern representation learning, showing how geometric structure can fundamentally reduce the complexity of distinct counting.
Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
Nguyen, Minh-Toan, Barbier, Jean
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.
Dynamic Vine Copulas: Detecting and Quantifying Time-Varying Higher-Order Interactions
Safaai, Houman, Vargas, Alessandro Marin
Time-varying dependence is often modeled with dynamic correlations or Gaussian graphical models, but multivariate systems can change through tail behavior, asymmetry, or conditional structure even when correlations are nearly stable. We introduce Dynamic Vine Copulas (DVC), a temporal vine-copula framework for estimating and diagnosing sequence-wide non-Gaussian dependence. DVC fixes a chosen vine factorization for comparability; the framework applies to C-, D-, and R-vines, and our experiments use fixed-root-order C-vines. Pair-copula states evolve through smooth parameter trajectories or temporally regularized family-switching paths. The main diagnostic is a held-out comparison between a full vine and its matched 1-truncated version, which separates flexible first-tree pairwise dependence from evidence contributed by higher-tree conditional terms. At the population level, under a correct fixed vine and the simplifying assumption, this contrast equals the higher-tree component of a vine total-correlation decomposition; in finite samples, it is a predictive diagnostic. In controlled benchmarks, DVC detects Student-t degrees-of-freedom changes, Clayton-to-Gumbel switches, and recurrent conditional-interaction episodes missed or conflated by Gaussian dynamic baselines. The higher-tree score remains near zero in pairwise-only regimes and rises during conditional-interaction regimes. On Allen Visual Behavior Neuropixels data, DVC identifies a reproducible time-indexed higher-tree signal that is positive across held-out splits and vanishes under a decorrelated null, indicating simultaneous cross-area dependence. DVC therefore provides a flexible temporal copula model and an interpretable test of whether temporal dependence changes are pairwise or conditional.
0004d0b59e19461ff126e3a08a814c33-AuthorFeedback.pdf
We sincerely appreciate the reviewers for their careful reading, constructive questions and suggestions. We would very1 much like further exchanges to improve our work, but the following is our best effort within the current limits.2 First, we address questions appeared at least twice. We write P1, P2 for paragraph reference, and Rx for reviewers.3 We discuss two main motivations here: lack of graph loss, and empirical failure4 of distinguishing power.
interpretation of regularization
Blue arrows indicate node feature vectors hv of the latent space, and the orange area/point indicate possible range of graph feature vector hG obtained by applying READOUT to hv. We elaborate our motivation behind orthogonal regularization (15) proposed in Section 4.2.3. The biggest motivation behind orthognoal regularization lies in understanding (8) and (12) that the node features H becomes full rank matrix with good condition number. Figure 5 visually demonstrates the geometric effect of attention-based READOUT and orthogonal regularization with two example node features h1 and h2. Only one graph feature vector hG is possible from the combination of two node features with conventional READOUT, while vectors within the range of the orange rhombus can represent the whole graph feature with attention-based READOUT. With orthogonal regularization, area of the range that the graph feature vector hG can represent become even larger, with lower possibility of null subspace within H. Accordingly, the subspace that H can span can be rich enough.
Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videos
Understanding how biological visual systems process information is challenging because of the nonlinear relationship between visual input and neuronal responses. Artificial neural networks allow computational neuroscientists to create predictive models that connect biological and machine vision. Machine learning has benefited tremendously from benchmarks that compare different models on the same task under standardized conditions. However, there was no standardized benchmark to identify state-of-the-art dynamic models of the mouse visual system. To address this gap, we established the SENSORIUM 2023 Benchmark Competition with dynamic input, featuring a new large-scale dataset from the primary visual cortex of ten mice.
d921c3c762b1522c475ac8fc0811bb0f-AuthorFeedback.pdf
We wish to thank all of the reviewers for their time and thorough reading of our paper! We appreciate the reviewer's suggestions regarding clarity. We have added the suggested summary sentence "the key We started with binary sentiment classification, but are actively working on more tasks. RNN hidden states onto the top two PCs for two different input sequences that differ only by two tokens (replacing ' The trajectories start out the same as the initial tokens are identical. We have added a footnote noting this in the main text.