Statistical Learning
959ef477884b6ac2241b19ee4fb776ae-AuthorFeedback.pdf
The proposed group bilinear requires the intra-group channels to be highly5 correlated (refer tothedefinitioninQ3.1),andtheproposed semantic grouping canbetter satisfy suchrequirements6 than MA-CNN [9]. Specifically,[9] adopts the idea of k-means, which optimizes each channel to its cluster center.7 Note that the notations aboveare the same with Eqn.16 (3),andthepairwisecorrelationis dij = Thanks foryour comments.Aisanapproximate indexmapping20 matrix, whose rows are constrained to be (approximate) one-hot vectors via asoftmax with small "temperature".21 Q2.2Inconsistentnotations. Thanks for your comments, and we will correct the notation "stage 3,4" into "Stage28 IV,V"respectively. Designing suitable grouping methods plays a42 keyrole.
Deriving Neural Scaling Laws from the statistics of natural language
Cagnetta, Francesco, Raventรณs, Allan, Ganguli, Surya, Wyart, Matthieu
Despite the fact that experimental neural scaling laws have substantially guided empirical progress in large-scale machine learning, no existing theory can quantitatively predict the exponents of these important laws for any modern LLM trained on any natural language dataset. We provide the first such theory in the case of data-limited scaling laws. We isolate two key statistical properties of language that alone can predict neural scaling exponents: (i) the decay of pairwise token correlations with time separation between token pairs, and (ii) the decay of the next-token conditional entropy with the length of the conditioning context. We further derive a simple formula in terms of these statistics that predicts data-limited neural scaling exponents from first principles without any free parameters or synthetic data models. Our theory exhibits a remarkable match with experimentally measured neural scaling laws obtained from training GPT-2 and LLaMA style models from scratch on two qualitatively different benchmarks, TinyStories and WikiText.
PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers
Bastian, Julien, Leblanc, Benjamin, Germain, Pascal, Habrard, Amaury, Largeron, Christine, Metzler, Guillaume, Morvant, Emilie, Viallard, Paul
Classical PAC generalization bounds on the prediction risk of a classifier are insufficient to provide theoretical guarantees on fairness when the goal is to learn models balancing predictive risk and fairness constraints. We propose a PAC-Bayesian framework for deriving generalization bounds for fairness, covering both stochastic and deterministic classifiers. For stochastic classifiers, we derive a fairness bound using standard PAC-Bayes techniques. Whereas for deterministic classifiers, as usual PAC-Bayes arguments do not apply directly, we leverage a recent advance in PAC-Bayes to extend the fairness bound beyond the stochastic setting. Our framework has two advantages: (i) It applies to a broad class of fairness measures that can be expressed as a risk discrepancy, and (ii) it leads to a self-bounding algorithm in which the learning procedure directly optimizes a trade-off between generalization bounds on the prediction risk and on the fairness. We empirically evaluate our framework with three classical fairness measures, demonstrating not only its usefulness but also the tightness of our bounds.
Mixed-Integer Programming for Change-point Detection
Narula, Apoorva, Dey, Santanu S., Xie, Yao
We present a new mixed-integer programming (MIP) approach for offline multiple change-point detection by casting the problem as a globally optimal piecewise linear (PWL) fitting problem. Our main contribution is a family of strengthened MIP formulations whose linear programming (LP) relaxations admit integral projections onto the segment assignment variables, which encode the segment membership of each data point. This property yields provably tighter relaxations than existing formulations for offline multiple change-point detection. We further extend the framework to two settings of active research interest: (i) multidimensional PWL models with shared change-points, and (ii) sparse change-point detection, where only a subset of dimensions undergo structural change. Extensive computational experiments on benchmark real-world datasets demonstrate that the proposed formulations achieve reductions in solution times under both $\ell_1$ and $\ell_2$ loss functions in comparison to the state-of-the-art.
Locally Interpretable Individualized Treatment Rules for Black-Box Decision Models
Charvadeh, Yasin Khadem, Panageas, Katherine S., Chen, Yuan
Existing methods typically rely on either interpretable but inflexible models or highly flexible black-box approaches that sacrifice interpretability; moreover, most impose a single global decision rule across patients. We introduce the Locally Interpretable Individualized Treatment Rule (LI-ITR) method, which combines flexible machine learning models to accurately learn complex treatment outcomes with locally interpretable approximations to construct subject-specific treatment rules. LI-ITR employs variational autoencoders to generate realistic local synthetic samples and learns individualized decision rules through a mixture of interpretable experts. Simulation studies show that LI-ITR accurately recovers true subject-specific local coefficients and optimal treatment strategies. An application to precision side-effect management in breast cancer illustrates the necessity of flexible predictive modeling and highlights the practical utility of LI-ITR in estimating optimal treatment rules while providing transparent, clinically interpretable explanations.