esd
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- North America > Canada (0.04)
Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias
Hu, Yuanzhe, Goel, Kinshuk, Killiakov, Vlad, Yang, Yaoqing
Diagnosing deep neural networks (DNNs) by analyzing the eigenspectrum of their weights has been an active area of research in recent years. One of the main approaches involves measuring the heavytailness of the empirical spectral densities (ESDs) of weight matrices. This analysis has been shown to provide insights to help diagnose whether a model is well-trained or undertrained, and has been used to guide training methods involving layer-wise hyperparameter assignment. In this paper, we address an often-overlooked challenge in estimating the heavytailness of these ESDs: the impact of the aspect ratio of weight matrices. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating the heavytailness of ESDs, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that this method effectively mitigates the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with state-of-the-art methods.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (5 more...)
Grokking and Generalization Collapse: Insights from \texttt{HTSR} theory
Prakash, Hari K., Martin, Charles H.
We study the well-known grokking phenomena in neural networks (NNs) using a 3-layer MLP trained on 1 k-sample subset of MNIST, with and without weight decay, and discover a novel third phase -- \emph{anti-grokking} -- that occurs very late in training and resembles but is distinct from the familiar \emph{pre-grokking} phases: test accuracy collapses while training accuracy stays perfect. This late-stage collapse is distinct, from the known pre-grokking and grokking phases, and is not detected by other proposed grokking progress measures. Leveraging Heavy-Tailed Self-Regularization HTSR through the open-source WeightWatcher tool, we show that the HTSR layer quality metric $α$ alone delineates all three phases, whereas the best competing metrics detect only the first two. The \emph{anti-grokking} is revealed by training for $10^7$ and is invariably heralded by $α< 2$ and the appearance of \emph{Correlation Traps} -- outlier singular values in the randomized layer weight matrices that make the layer weight matrix atypical and signal overfitting of the training set. Such traps are verified by visual inspection of the layer-wise empirical spectral densities, and by using Kolmogorov--Smirnov tests on randomized spectra. Comparative metrics, including activation sparsity, absolute weight entropy, circuit complexity, and $l^2$ weight norms track pre-grokking and grokking but fail to distinguish grokking from anti-grokking. This discovery provides a way to measure overfitting and generalization collapse without direct access to the test data. These results strengthen the claim that the \emph{HTSR} $α$ provides universal layer-convergence target at $α\approx 2$ and underscore the value of using the HTSR alpha $(α)$ metric as a measure of generalization.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
Using Pre-trained LLMs for Multivariate Time Series Forecasting
Wolff, Malcolm L., Yang, Shenghao, Torkkola, Kari, Mahoney, Michael W.
Time series forecasting refers to a class of techniques for the prediction of events through a sequence of time, typically to inform strategic or tactical decision making. Going beyond strategic forecasting problems (e.g., those commonly-used historically in statistics and econometrics [1]), operational forecasting problems are increasingly-important. For example, at large internet retail companies, this includes demand forecasting for products at an online retailer, work force cohorts of a company in its locations, compute capacity needs per region and server type, etc.; in scientific machine learning, this includes prediction of extreme events in, e.g., climate and weather models; and so on. In particular, MQCNN [2] and MQTransformer [3] are stateof-the-art (SOTA) neural network (NN) based multivariate time series forecasting models that are used to predict future demand at the product level for hundreds of millions of products.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
- Government > Military (0.54)
- Retail > Online (0.34)
- Information Technology > Modeling & Simulation (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
How Does Data Diversity Shape the Weight Landscape of Neural Networks?
Ba, Yang, Mancenido, Michelle V., Pan, Rong
To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay ($L_2$ regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.
- North America > United States > Arizona (0.04)
- Asia > Singapore (0.04)
- Asia > Indonesia > Bali (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
Model Balancing Helps Low-data Training and Fine-tuning
Liu, Zihang, Hu, Yuanzhe, Pang, Tianyu, Zhou, Yefan, Ren, Pu, Yang, Yaoqing
Recent advances in foundation models have emphasized the need to align pre-trained models with specialized domains using small, curated datasets. Studies on these foundation models underscore the importance of low-data training and fine-tuning. This topic, well-known in natural language processing (NLP), has also gained increasing attention in the emerging field of scientific machine learning (SciML). To address the limitations of low-data training and fine-tuning, we draw inspiration from Heavy-Tailed Self-Regularization (HT-SR) theory, analyzing the shape of empirical spectral densities (ESDs) and revealing an imbalance in training quality across different model layers. To mitigate this issue, we adapt a recently proposed layer-wise learning rate scheduler, TempBalance, which effectively balances training quality across layers and enhances low-data training and fine-tuning for both NLP and SciML tasks. Notably, TempBalance demonstrates increasing performance gains as the amount of available tuning data decreases. Comparative analyses further highlight the effectiveness of TempBalance and its adaptability as an "add-on" method for improving model performance.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Crafting Heavy-Tails in Weight Matrix Spectrum without Gradient Noise
Kothapalli, Vignesh, Pang, Tianyu, Deng, Shenyang, Liu, Zongmin, Yang, Yaoqing
Modern training strategies of deep neural networks (NNs) tend to induce a heavy-tailed (HT) spectra of layer weights. Extensive efforts to study this phenomenon have found that NNs with HT weight spectra tend to generalize well. A prevailing notion for the occurrence of such HT spectra attributes gradient noise during training as a key contributing factor. Our work shows that gradient noise is unnecessary for generating HT weight spectra: two-layer NNs trained with full-batch Gradient Descent/Adam can exhibit HT spectra in their weights after finite training steps. To this end, we first identify the scale of the learning rate at which one step of full-batch Adam can lead to feature learning in the shallow NN, particularly when learning a single index teacher model. Next, we show that multiple optimizer steps with such (sufficiently) large learning rates can transition the bulk of the weight's spectra into an HT distribution. To understand this behavior, we present a novel perspective based on the singular vectors of the weight matrices and optimizer updates. We show that the HT weight spectrum originates from the `spike', which is generated from feature learning and interacts with the main bulk to generate an HT spectrum. Finally, we analyze the correlations between the HT weight spectra and generalization after multiple optimizer updates with varying learning rates.
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization
Wang, Yutong, Sonthalia, Rishi, Hu, Wei
We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbol{\beta}$'s whose training error $\tau$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbol{\Sigma}$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $\tau$ fixed, $\boldsymbol{\beta}$ has squared $\ell_2$-norm $\mathbb{E}[\|{\boldsymbol{\beta}}\|_{2}^{2}] = \Omega(n^{\alpha})$ where $n$ is the number of samples and $\alpha >1$ is the exponent of the eigendecay, i.e., $\lambda_i(\boldsymbol{\Sigma}) \sim i^{-\alpha}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.
- North America > United States > Michigan (0.04)
- Europe > Spain (0.04)
Taming Mode Collapse in Score Distillation for Text-to-3D Generation
Wang, Peihao, Xu, Dejia, Fan, Zhiwen, Wang, Dilin, Mohan, Sreyas, Iandola, Forrest, Ranjan, Rakesh, Li, Yilei, Liu, Qiang, Wang, Zhangyang, Chandra, Vikas
Despite the remarkable performance of score distillation in text-to-3D generation, such techniques notoriously suffer from view inconsistency issues, also known as "Janus" artifact, where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering, a more rigorous perspective to explain and tackle this problem remains elusive. In this paper, we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem, manifesting as the Janus artifact in practice. To tame mode collapse, we improve score distillation by re-establishing in entropy term in the corresponding variational objective, which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets, thereby mitigating the Janus problem. Based on this new objective, we derive a new update rule for 3D score distillation, dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward, our extensive experiments successfully demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation.
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > United Kingdom > England (0.04)
- Asia > Japan > Honshū > Chūbu > Nagano Prefecture > Nagano (0.04)