AITopics | generalization loss

Collaborating Authors

generalization loss

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

08fe4b20d554296e503f5a43795c78d6-Supplemental-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 10:53:08 GMT

A.1 Proof of Theorem 2 Consider an instance of semi-supervised active regression with the labeled dataset Xlab being empty, and with the unlabeled dataset composed of k copies of the standard basis vector ei for each i = 1,,d.

algorithm 2, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.54)

Add feedback

de3d2bb604cfc43c81edd2a31b257f03-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 08:48:34 GMT

Building upon thedataset distillation framework,wemakeakeyobservationthata sharedcommonrepresentation allows for more efficient and effective distillation.

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

9b8b50fb590c590ffbf1295ce92258dc-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 12:55:30 GMT

eigenvalue, equation, simulation, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
North America > Canada (0.04)
Europe > France (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

9b8b50fb590c590ffbf1295ce92258dc-Supplemental.pdf

Neural Information Processing SystemsAug-15-2025, 09:13:29 GMT

eigenvalue, equation, generalization loss, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > France (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

An Asymptotic Equation Linking WAIC and WBIC in Singular Models

Hayashi, Naoki, Kutsuna, Takuro, Takamuku, Sawa

arXiv.org Machine LearningMay-22-2025

In statistical learning, models are classified as regular or singular depending on whether the mapping from parameters to probability distributions is injective. Most models with hierarchical structures or latent variables are singular, for which conventional criteria such as the Akaike Information Criterion and the Bayesian Information Criterion are inapplicable due to the breakdown of normal approximations for the likelihood and posterior. To address this, the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC) have been proposed. Since WAIC and WBIC are computed using posterior distributions at different temperature settings, separate posterior sampling is generally required. In this paper, we theoretically derive an asymptotic equation that links WAIC and WBIC, despite their dependence on different posteriors. This equation yields an asymptotically unbiased expression of WAIC in terms of the posterior distribution used for WBIC. The result clarifies the structural relationship between these criteria within the framework of singular learning theory, and deepens understanding of their asymptotic behavior. This theoretical contribution provides a foundation for future developments in the computational efficiency of model selection in singular models.

artificial intelligence, bayesian inference, machine learning, (17 more...)

arXiv.org Machine Learning

2505.13902

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

Add feedback

Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models

Belenki, Lior, Agarwal, Alekh, Shi, Tianze, Toutanova, Kristina

arXiv.org Artificial IntelligenceFeb-21-2025

We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures. Experiments with Transformer decoder-only language models in the range of 70M to 1B parameters on the SlimPajama dataset show that our method achieves significantly better performance than approaches that train regression models using only the mixture rates as input features. Combining this improved optimization method with an objective that takes into account cross-entropy on end task data leads to superior performance on few-shot downstream evaluations. We also provide theoretical insights on why aggregation of data expert predictions can provide good approximations to model losses for data mixtures.

language model, regression model, validation domain, (16 more...)

arXiv.org Artificial Intelligence

2502.1595

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(8 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.56)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Monotonic Learning in the PAC Framework: A New Perspective

Li, Ming, Zhang, Chenyi, Li, Qin

arXiv.org Artificial IntelligenceJan-9-2025

Monotone learning refers to learning processes in which expected performance consistently improves as more training data is introduced. Non-monotone behavior of machine learning has been the topic of a series of recent works, with various proposals that ensure monotonicity by applying transformations or wrappers on learning algorithms. In this work, from a different perspective, we tackle the topic of monotone learning within the framework of Probably Approximately Correct (PAC) learning theory. Following the mechanism that estimates sample complexity of a PAC-learnable problem, we derive a performance lower bound for that problem, and prove the monotonicity of that bound as the sample sizes increase. By calculating the lower bound distribution, we are able to prove that given a PAC-learnable problem with a hypothesis space that is either of finite size or of finite VC dimension, any learning algorithm based on Empirical Risk Minimization (ERM) is monotone if training samples are independent and identically distributed (i.i.d.). We further carry out an experiment on two concrete machine learning problems, one of which has a finite hypothesis set, and the other of finite VC dimension, and compared the experimental data for the empirical risk distributions with the estimated theoretical bound. The results of the comparison have confirmed the monotonicity of learning for the two PAC-learnable problems.

artificial intelligence, generalization loss, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2501.05493

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.36)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Grokking at the Edge of Linear Separability

Beck, Alon, Levi, Noam, Bar-Sinai, Yohai

arXiv.org Machine LearningOct-6-2024

We study the generalization properties of binary logistic classification in a simplified setting, for which a "memorizing" and "generalizing" solution can always be strictly defined, and elucidate empirically and analytically the mechanism underlying Grokking in its dynamics. We analyze the asymptotic long-time dynamics of logistic classification on a random feature model with a constant label and show that it exhibits Grokking, in the sense of delayed generalization and non-monotonic test loss. We find that Grokking is amplified when classification is applied to training sets which are on the verge of linear separability. Even though a perfect generalizing solution always exists, we prove the implicit bias of the logisitc loss will cause the model to overfit if the training data is linearly separable from the origin. For training sets that are not separable from the origin, the model will always generalize perfectly asymptotically, but overfitting may occur at early stages of training. Importantly, in the vicinity of the transition, that is, for training sets that are almost separable from the origin, the model may overfit for arbitrarily long times before generalizing. We gain more insights by examining a tractable one-dimensional toy model that quantitatively captures the key features of the full model. Finally, we highlight intriguing common properties of our findings with recent literature, suggesting that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

accuracy, diverge, grokking, (16 more...)

arXiv.org Machine Learning

2410.04489

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)

Add feedback

Grokking in Linear Estimators -- A Solvable Model that Groks without Understanding

Levi, Noam, Beck, Alon, Bar-Sinai, Yohai

arXiv.org Machine LearningOct-25-2023

Understanding the underlying correlations in complex datasets is the main challenge of statistical learning. Assuming that training and generalization data are drawn from a similar distribution, the discrepancy between training and generalization metrics quantifies how well a model extracts meaningful features from the training data, and what portion of its reasoning is based on idiosyncrasies in the training data. Traditionally, one would expect that once a neural network (NN) training converges to a low loss value, the generalization error should either plateau, for good models, or deteriorate for models that overfit. Surprisingly, [18] found that a shallow transformer trained on algorithmic datasets features drastically different dynamics. The network first overfits the training data, achieving low and stable training loss with high generalization error for an extended period, then suddenly and rapidly transitions to a perfect generalization phase. This counter-intuitive phenomenon, dubbed grokking, has recently garnered much attention and many underlying mechanisms have been proposed as possible explanations. These include the difficulty of representation learning [10], the scale of parameters at initialization [11], spikes in loss ("slingshots") [21], random walks among optimal solutions [15], and the simplicity of the generalising solution [16, Appendix E]. In this paper we take a different approach, leveraging the simplest possible models which still display grokking - linear estimators.

artificial intelligence, machine learning, matrix, (19 more...)

arXiv.org Machine Learning

2310.16441

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

PAC-Bayesian bounds for learning LTI-ss systems with input from empirical loss

Eringis, Deividas, Leth, John, Tan, Zheng-Hua, Wisniewski, Rafael, Petreczky, Mihaly

arXiv.org Artificial IntelligenceMar-29-2023

In this paper we derive a Probably Approxilmately Correct(PAC)-Bayesian error bound for linear time-invariant (LTI) stochastic dynamical systems with inputs. Such bounds are widespread in machine learning, and they are useful for characterizing the predictive power of models learned from finitely many data points. In particular, with the bound derived in this paper relates future average prediction errors with the prediction error generated by the model on the data used for learning. In turn, this allows us to provide finite-sample error bounds for a wide class of learning/system identification algorithms. Furthermore, as LTI systems are a sub-class of recurrent neural networks (RNNs), these error bounds could be a first step towards PAC-Bayesian bounds for RNNs.

artificial intelligence, bayesian inference, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2303.16816

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Denmark > North Jutland > Aalborg (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(2 more...)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback