AITopics | optimizer adam

6 Appendix

Neural Information Processing SystemsFeb-10-2026, 07:16:48 GMT

We observe that for the self-attention layers, the correlation of weights for the same head is stronger. Additionally, the best grouping might depend on the type of the layer (e.g., key, query, value, or To simplify the implementation, we treat all the different kernels in the self-attention as a type of fully-connected layer. We down-sample along each dimension to make the computation feasible. To relate with the Frobenius norm, we compute the square of each element and normalize the value. In Figure 5, we show the approximation error comparison for different approximation methods.

artificial intelligence, dimension, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.55)
Information Technology > Artificial Intelligence > Machine Learning (0.49)

Add feedback

a1d4c20b182ad7137ab3606f0e3fc8a4-Supplemental.pdf

Neural Information Processing SystemsFeb-9-2026, 15:16:07 GMT

optimizer adam, padding, stride, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.90)

Add feedback

8d9a6e908ed2b731fb96151d9bb94d49-Supplemental.pdf

Neural Information Processing SystemsNov-15-2025, 01:24:06 GMT

agent, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.31)

Add feedback

6 Appendix

Neural Information Processing SystemsOct-8-2025, 11:18:30 GMT

We observe that for the self-attention layers, the correlation of weights for the same head is stronger. Additionally, the best grouping might depend on the type of the layer (e.g., key, query, value, or To simplify the implementation, we treat all the different kernels in the self-attention as a type of fully-connected layer. We down-sample along each dimension to make the computation feasible. To relate with the Frobenius norm, we compute the square of each element and normalize the value. In Figure 5, we show the approximation error comparison for different approximation methods.

artificial intelligence, dimension, machine learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.55)
Information Technology > Artificial Intelligence > Machine Learning (0.49)

Add feedback

8d9a6e908ed2b731fb96151d9bb94d49-Supplemental.pdf

Neural Information Processing SystemsAug-15-2025, 21:08:13 GMT

agent, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.31)

Add feedback

Federated clustering (FC) is an essential extension of centralized clustering designed for the federated setting, wherein the challenge lies in constructing a global similarity measure without the need to share private data. Conventional approaches to FC typically adopt extensions of centralized methods, like K-means and fuzzy c-means. However, these methods are susceptible to non-independent-and-identically-distributed (non-IID) data among clients, leading to suboptimal performance, particularly with high-dimensional data. In this paper, we present a novel approach to address these limitations by proposing a Privacy-Preserving Federated Deep Clustering based on Generative Adversarial Networks (GANs). Each client trains a local generative adversarial network (GAN) locally and uploads the synthetic data to the server. The server applies a deep clustering network on the synthetic data to establish $k$ cluster centroids, which are then downloaded to the clients for cluster assignment. Theoretical analysis demonstrates that the GAN-generated samples, shared among clients, inherently uphold certain privacy guarantees, safeguarding the confidentiality of individual data. Furthermore, extensive experimental evaluations showcase the effectiveness and utility of our proposed method in achieving accurate and privacy-preserving federated clustering.

centroid, dataset, ppfc-gan, (12 more...)

arXiv.org Artificial Intelligence

2211.16965

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.87)
Information Technology > Data Science > Data Mining > Big Data (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.47)

Add feedback

Understanding and Improving Layer Normalization

Xu, Jingjing, Sun, Xu, Zhang, Zhiyuan, Zhao, Guangxiang, Lin, Junyang

arXiv.org Machine LearningNov-16-2019

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

derivative, layernorm, normalization, (15 more...)

arXiv.org Machine Learning

1911.07013

Country: