AITopics | weight layer

Collaborating Authors

weight layer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Additional Background on Bayesian neural networks and variational inference Consider a training set comprising of N input-output pairs, D = { x

Neural Information Processing SystemsAug-16-2025, 09:18:55 GMT

Neal, 2012, Blundell et al., 2015], and (iii) using structured variational approximations that can potentially capture weight correlations in the posterior [Louizos and Welling, 2016, Zhang et al., We also vary the amount of inducing points we afford each kernel. The main difference in the local model is the dependence of weights on inputs.

approximation, kernel, neural network, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Add feedback

Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

Berlyand, Leonid, Bourdais, Theo, Owhadi, Houman, Shmalo, Yitzchak

arXiv.org Artificial IntelligenceMar-5-2025

Deep neural networks (DNNs) have brought significant advancements in various applications in recent years, such as image recognition, speech recognition, and natural language processing. In particular, Vision Transformers (ViTs) have emerged as a powerful class of models in the field of deep learning for image classification. In this work, we propose a novel Random Matrix Theory (RMT)-based method for pruning pre-trained DNNs, based on the sparsification of weights and singular vectors, and apply it to ViTs. RMT provides a robust framework to analyze the statistical properties of large matrices, which has been shown to be crucial for understanding and optimizing the performance of DNNs. We demonstrate that our RMT-based pruning can be used to reduce the number of parameters of ViT models (trained on ImageNet) by 30-50\% with less than 1\% loss in accuracy. To our knowledge, this represents the state-of-the-art in pruning for these ViT models. Furthermore, we provide a rigorous mathematical underpinning of the above numerical studies, namely we proved a theorem for fully connected DNNs, and other more general DNN structures, describing how the randomness in the weight matrices of a DNN decreases as the weights approach a local or global minimum (during training). We verify this theorem through numerical experiments on fully connected DNNs, providing empirical support for our theoretical findings. Moreover, we prove a theorem that describes how DNN loss decreases as we remove randomness in the weight layers, and show a monotone dependence of the decrease in loss with the amount of randomness that we remove. Our results also provide significant RMT-based insights into the role of regularization during training and pruning.

matrix, pruning, singular value, (16 more...)

arXiv.org Artificial Intelligence

2503.01922

Country:

North America > United States > Pennsylvania > Centre County > University Park (0.04)
North America > United States > California > Los Angeles County > Pasadena (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models

Mills, Keith G., Salameh, Mohammad, Chen, Ruichen, Hassanpour, Negar, Lu, Wei, Niu, Di

arXiv.org Artificial IntelligenceDec-19-2024

Diffusion Models (DM) have democratized AI image generation through an iterative denoising process. Quantization is a major technique to alleviate the inference cost and reduce the size of DM denoiser networks. However, as denoisers evolve from variants of convolutional U-Nets toward newer Transformer architectures, it is of growing importance to understand the quantization sensitivity of different weight layers, operations and architecture types to performance. In this work, we address this challenge with Qua$^2$SeDiMo, a mixed-precision Post-Training Quantization framework that generates explainable insights on the cost-effectiveness of various model weight quantization methods for different denoiser operation types and block structures. We leverage these insights to make high-quality mixed-precision quantization decisions for a myriad of diffusion models ranging from foundational U-Nets to state-of-the-art Transformers. As a result, Qua$^2$SeDiMo can construct 3.4-bit, 3.9-bit, 3.65-bit and 3.7-bit weight quantization on PixArt-${\alpha}$, PixArt-${\Sigma}$, Hunyuan-DiT and SDXL, respectively. We further pair our weight-quantization configurations with 6-bit activation quantization and outperform existing approaches in terms of quantitative metrics and generative image quality.

artificial intelligence, configuration, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2412.14628

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > Puerto Rico > San Juan > San Juan (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Pyramid Vector Quantization for LLMs

van der Ouderaa, Tycho F. A., Croci, Maximilian L., Hilmkil, Agrin, Hensman, James

arXiv.org Artificial IntelligenceDec-4-2024

Recent works on compression of large language models (LLM) using quantization considered reparameterizing the architecture such that weights are distributed on the sphere. This demonstratively improves the ability to quantize by increasing the mathematical notion of coherence, resulting in fewer weight outliers without affecting the network output. In this work, we aim to further exploit this spherical geometry of the weights when performing quantization by considering Pyramid Vector Quantization (PVQ) for large language models. Arranging points evenly on the sphere is notoriously difficult, especially in high dimensions, and in case approximate solutions exists, representing points explicitly in a codebook is typically not feasible due to its additional memory cost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for efficient encoding and decoding without requiring an explicit codebook in memory. To obtain a practical algorithm, we propose to combine PVQ with scale quantization for which we derive theoretically optimal quantizations, under empirically verified assumptions. Further, we extend pyramid vector quantization to use Hessian information to minimize quantization error under expected feature activations, instead of only relying on weight magnitudes. Experimentally, we achieves state-of-the-art quantization performance with pareto-optimal trade-off between performance and bits per weight and bits per activation, compared to compared methods. On weight-only, we find that we can quantize a Llama-3 70B model to 3.25 bits per weight and retain 98\% accuracy on downstream tasks.

empirical, groupsize, weight layer, (15 more...)

arXiv.org Artificial Intelligence

2410.16926

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

On the Approximation of Bi-Lipschitz Maps by Invertible Neural Networks

Jin, Bangti, Zhou, Zehui, Zou, Jun

arXiv.org Artificial IntelligenceAug-18-2023

Invertible neural networks (INNs) are a class of neural network (NN) architectures with invertibility by design, via special invertible layers called the flow layers. INNs often enjoy tractable numerical algorithms to compute the inverse map and Jacobian determinant, e.g., with explicit inversion formulas. These distinct features have made them very attractive for a variety of machine learning tasks, e.g., generative modeling [16, 31, 29], probabilistic modeling [38, 17, 23, 6], solving inverse problems [2, 1, 3], modeling nonlinear dynamics [9] and point cloud generation [44]. There are several different classes of INNs, including invertible residual networks (iResNet) [7, 43], neural ordinary differential equations (NODEs) [11, 13, 18] and coupling-based neural networks [16, 17, 25, 31, 2]. For iResNet, Behrmann et al [7] leveraged the viewpoint of ResNets as an Euler discretization of ODEs and proved the standard ResNet architecture can be made invertible by adding a simple normalization step to control the Lipschitz constant of the NN during training. The inverse is not available in closed form but can be obtained through a fixed-point iteration. Chen et al [13] proposed using black-box ODE solvers as a model component, and developed a class of new models, i.e., NODEs, for time-series modeling, supervised learning, and density estimation etc. NODEs indirectly models an invertible function by transforming an input vector through an ordinary differential equation (ODE). Dupont and Doucet [18] introduced a class of more expressive and empirically stable models, augmented neural ODEs (ANODEs), which have a lower computational cost.

artificial intelligence, lemma 2, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2308.09367

Country:

Asia > China > Hong Kong (0.04)
North America > United States > New York (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

How to Use Dropout Correctly on Residual Networks with Batch Normalization

Kim, Bum Jun, Choi, Hyeyeon, Jang, Hyeonah, Lee, Donggeon, Kim, Sang Woo

arXiv.org Artificial IntelligenceFeb-13-2023

For the stable optimization of deep neural networks, regularization methods such as dropout and batch normalization have been used in various tasks. Nevertheless, the correct position to apply dropout has rarely been discussed, and different positions have been employed depending on the practitioners. In this study, we investigate the correct position to apply dropout. We demonstrate that for a residual network with batch normalization, applying dropout at certain positions increases the performance, whereas applying dropout at other positions decreases the performance. Based on theoretical analysis, we provide the following guideline for the correct position to apply dropout: apply one dropout after the last batch normalization but before the last weight layer in the residual branch. We provide detailed theoretical explanations to support this claim and demonstrate them through module tests. In addition, we investigate the correct position of dropout in the head that produces the final prediction. Although the current consensus is to apply dropout after global average pooling, we prove that applying dropout before global average pooling leads to a more stable output. The proposed guidelines are validated through experiments using different datasets and models.

artificial intelligence, dropout, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.06112

Country: Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Chen, Guangyong, Chen, Pengfei, Shi, Yujun, Hsieh, Chang-Yu, Liao, Benben, Zhang, Shengyu

arXiv.org Machine LearningMay-14-2019

In this work, we propose a novel technique to boost training efficiency of a neural network. Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent. However, determining independent components is a computationally intensive task. To overcome this challenge, we propose to implement an IC layer by combining two popular techniques, Batch Normalization and Dropout, in a new manner that we can rigorously prove that Dropout can quadratically reduce the mutual information and linearly reduce the correlation between any pair of neurons with respect to the dropout layer parameter $p$. As demonstrated experimentally, the IC layer consistently outperforms the baseline approaches with more stable training process, faster convergence speed and better convergence limit on CIFAR10/100 and ILSVRC2012 datasets. The implementation of our IC layer makes us rethink the common practices in the design of neural networks. For example, we should not place Batch Normalization before ReLU since the non-negative responses of ReLU will make the weight layer updated in a suboptimal way, and we can achieve better performance by combining Batch Normalization and Dropout together as an IC layer.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

1905.05928

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Building Efficient Deep Neural Networks with Unitary Group Convolutions

Zhao, Ritchie, Hu, Yuwei, Dotzel, Jordan, De Sa, Christopher, Zhang, Zhiru

arXiv.org Machine LearningNov-19-2018

We propose unitary group convolutions (UGConvs), a building block for CNNs which compose a group convolution with unitary transforms in feature space to learn a richer set of representations than group convolution alone. UGConvs generalize two disparate ideas in CNN architecture, channel shuffling (i.e. ShuffleNet) and block-circulant networks (i.e. CirCNN), and provide unifying insights that lead to a deeper understanding of each technique. We experimentally demonstrate that dense unitary transforms can outperform channel shuffling in DNN accuracy. On the other hand, different dense transforms exhibit comparable accuracy performance. Based on these observations we propose HadaNet, a UGConv network using Hadamard transforms. HadaNets achieve similar accuracy to circulant networks with lower computation complexity, and better accuracy than ShuffleNets with the same number of parameters and floating-point multiplies.

artificial intelligence, convolution, machine learning, (17 more...)

arXiv.org Machine Learning

1811.07755

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Decision Boundary of Deep Neural Networks

Li, Yu, Ding, Lizhong, Gao, Xin

arXiv.org Artificial IntelligenceAug-23-2018

While deep learning models and techniques have achieved great empirical success, our understanding of the source of success in many aspects remains very limited. In an attempt to bridge the gap, we investigate the decision boundary of a production deep learning architecture with weak assumptions on both the training data and the model. We demonstrate, both theoretically and empirically, that the last weight layer of a neural network converges to a linear SVM trained on the output of the last hidden layer, for both the binary case and the multi-class case with the commonly used cross-entropy loss. Furthermore, we show empirically that training a neural network as a whole, instead of only fine-tuning the last weight layer, may result in better bias constant for the last weight layer, which is important for generalization. In addition to facilitating the understanding of deep learning, our result can be helpful for solving a broad range of practical problems of deep learning, such as catastrophic forgetting and adversarial attacking. The experiment codes are available at https://github.com/lykaust15/NN_decision_boundary

artificial intelligence, machine learning, neural network, (15 more...)

arXiv.org Artificial Intelligence

1808.05385

Country: Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

weight layer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

c70341de2c112a6b3496aec1f631dddd-Supplemental.pdf

A Additional Background on Bayesian neural networks and variational inference Consider a training set comprising of N input-output pairs, D = { x

Pruning Deep Neural Networks via a Combination of the Marchenko-Pastur Distribution and Regularization

Qua$^2$SeDiMo: Quantifiable Quantization Sensitivity of Diffusion Models

Pyramid Vector Quantization for LLMs

On the Approximation of Bi-Lipschitz Maps by Invertible Neural Networks

How to Use Dropout Correctly on Residual Networks with Batch Normalization

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Building Efficient Deep Neural Networks with Unitary Group Convolutions

On the Decision Boundary of Deep Neural Networks