AITopics | spectral norm

Collaborating Authors

spectral norm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Algorithms with Polynomially-Improved Approximation Factors for the $2 \rightarrow q$ Norm, and Applications

Hopkins, Samuel B., Tiegel, Stefan

arXiv.org Machine LearningMay-26-2026

The $2 \rightarrow q$ norm of a matrix $X \in \mathbb{R}^{n \times d}$ is defined as $\lVert X \rVert_{2 \rightarrow q} = \sup_{\lVert v \rVert_2 = 1} \lVert Xv \rVert_q$. We give polynomial-time multiplicative approximation algorithms for this norm when $q > 2$ (i.e. in the hypercontractive setting). This problem either directly captures or is closely related to long-standing open problems in combinatorial optimization and hardness of approximation (e.g. Small Set Expansion), quantum information (e.g. Best Separable State), and algorithmic statistics. Very little is known about what approximation factors we can achieve for this problem in polynomial time, even though such approximations have significant downstream consequences. Barak, Brandão, Harrow, Kelner, Steurer, and Zhou showed that no polynomial-time algorithm can achieve an approximation factor better than $2^{\sqrt{\log n}}$, assuming the Exponential Time Hypothesis (FOCS'12). On the other hand, a simple spectral algorithm gives a $d^{1/4}$-approximation as a baseline. We give, to the best of our knowledge, the first polynomial-time approximation algorithm beating this baseline by polynomial factors. For the important special case of $q = 4$ it achieves a $d^{1/8}$-approximation. All previous algorithms required additional assumptions on $X$, or only surpassed the baseline for small values of $n$. Moreover, we construct sum-of-squares certificates for the $2 \rightarrow q$ norm. This directly implies improved algorithms for robust mean and covariance estimation, robust regression, and clustering, when the data only satisfies a bound on its $q$-th moment.

algorithm, artificial intelligence, machine learning, (14 more...)

arXiv.org Machine Learning

2605.25303

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.34)

Add feedback

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Shi, Kexuan, Li, Hanxuan, Qiu, Zeju, Wen, Yandong, Buchholz, Simon, Liu, Weiyang

arXiv.org Machine LearningMay-13-2026

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Machine Learning

2605.12492

Country: Asia (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training

Neural Information Processing SystemsApr-27-2026, 19:21:51 GMT

Normalization techniques have become a basic component in modern convolutional neural networks (ConvNets). In particular, many recent works demonstrate that promoting the orthogonality of the weights helps train deep models and improve robustness. For ConvNets, most existing methods are based on penalizing or normalizing weight matrices derived from concatenating or flattening the convolutional kernels. These methods often destroy or ignore the benign convolutional structure of the kernels; therefore, they are often expensive or impractical for deep ConvNets. In contrast, we introduce a simple and efficient "Convolutional Normalization" (ConvNorm) method that can fully exploit the convolutional structure in the Fourier domain and serve as a simple plug-and-play module to be conveniently incorporated into any ConvNets. Our method is inspired by recent work on preconditioning methods for convolutional sparse coding and can effectively promote each layer's channel-wise isometry. Furthermore, we show that our ConvNorm can reduce the layerwise spectral norm of the weight matrices and hence improve the Lipschitzness of the network, leading to easier training and improved robustness for deep ConvNets. Applied to classification under noise corruptions and generative adversarial network (GAN), we show that the ConvNorm improves the robustness of common ConvNets such as ResNet and the performance of GAN. We verify our findings via numerical experiments on CIFAR and ImageNet.

artificial intelligence, convnet, machine learning, (19 more...)

Neural Information Processing Systems

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

The proposition makes use of the following observation: For the discriminator defined in (1), the norm of gradient for wt is upper bounded by k wtDθ(x)k F kxk LY

Neural Information Processing SystemsApr-25-2026, 21:09:05 GMT

The upper bound of gradient's Frobenius norm for spectrally-normalized discriminators follows directly. As lw(x) is a linear transformation, we have lcw(x) = c lw(x), and lw(cx) = c lw(x). Moreover, since ReLU and leaky ReLU is linear in R+ and R region, we have ai(cx) = c ai(x). In this section we discuss the gradients with respect the actual parameter wi. From Eq. (12) in [30] we know wtDθ(x) = A, we know that w0tDθ(x) F, otl(x)Dθ(x), and kotl (x)k have upper bounds. From Theorem 1.1 in [44] we know that if wt is initialized with i.i.d random variables from uniform or Gaussian distribution, E kwtkspis lower bounded away from zero at initialization. So k wtDθ(x)kF is upper bounded at initialization. Moreover, we observe empirically that kwtksp is usually increasing during training. Therefore, k wtDθ(x)kF is typically upper bounded during training as well. The following proposition states that spectral normalization also gives an upper bound on kHwi(Dθ)(x)ksp for networks with ReLU or leaky ReLU internal activations.

artificial intelligence, experiment, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

4ffb0d2ba92f664c2281970110a2e071-Paper.pdf

Neural Information Processing SystemsApr-25-2026, 21:09:00 GMT

artificial intelligence, deep learning, machine learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Genre: Research Report (0.46)

Industry: Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Supplementary Material 7 Elements of Group and Representation Theory

Neural Information Processing SystemsApr-25-2026, 03:11:32 GMT

In this section, we provide a brief introduction to the concepts from Group Theory which we need in our derivations. A group is a pair (G,)containing a set Gand a binary operation: G G! G,(h,g) 7! h g which satisfies the group axioms: Associativity: 8a,b,c 2 Ga (b c)=( a b) c Identity: 9e 2 G: 8g 2 Gg e = e g = g Inverse: 8g 2 G 9g 1 2 G: g g 1 = g 1 g = e The operation is the group law of G. The inverse elements g 1 of an element g, and the identity element e are unique. In addition, if the group law is also commutative, the group G is an abelian group. To simplify the notation, we commonly write ab instead of a b. It is also common to denote the group (G,) just with the name of its underlying set G. The order of a group G is the cardinality of its set and is indicated by |G|. A group G is finite when |G|2 N, i.e., when it has a finite number of elements. A compact group is a group that is also a compact topological space with continuous group operation. Given a group G, its action on a set X is a map . A simple example of group action is the group law itself: G G! Gwhich defines an action of G on its own elements (X = G). Another important action is the one defined on signals overs the group G. Given a signal x: G! R, the action of an element g 2 G maps x 7! g.x, [g.x](h):= x(g 1h).

artificial intelligence, dataset, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Appendix

Neural Information Processing SystemsApr-25-2026, 02:24:05 GMT

We extra define the following notations for the proof. In Assumption 3.2, we assume the Lipschitz continuity and smoothness for all the activation functions. In the proof of lemmas, e.g., Lemma B.1 and B.2, we only use the fact that they are Lipschitz continuous and smooth, as well as bounded by a constant 0 > 0 at point 0, hence we use () to denote all the activation functions like what we do in Assumption 3.2 for simplicity. Additionally, in the following we introduce notations of the derivatives, mainly used in the proof of Lemma B.1 and Lemma B.2. By definition of feedforward neural networks in Section 2, different from the standard neural networks such as FCNs and CNNs in which the connection between neurons are generally only in adjacent layers, the neurons in feedforward neural networks can be arbitrarily connected as long as there is no loop.

artificial intelligence, machine learning, probability, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

Neural Information Processing SystemsApr-25-2026, 02:24:01 GMT

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their "width" approaches infinity. The width of these general networks is characterized by the minimum indegree of their neurons, except for the input and first layers. Our results identify the mathematical structure underlying transition to linearity and generalize a number of recent works aimed at characterizing transition to linearity or constancy of the Neural Tangent Kernel for standard architectures.

artificial intelligence, machine learning, neural network, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Filters

Collaborating Authors

spectral norm

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Algorithms with Polynomially-Improved Approximation Factors for the $2 \rightarrow q$ Norm, and Applications

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Convolutional Normalization: Improving Deep Convolutional Network Robustness and Training

The proposition makes use of the following observation: For the discriminator defined in (1), the norm of gradient for wt is upper bounded by k wtDθ(x)k F kxk LY

4ffb0d2ba92f664c2281970110a2e071-Paper.pdf

18210aa6209b9adfc97b8c17c3741d95-Supplemental-Conference.pdf

18210aa6209b9adfc97b8c17c3741d95-Paper-Conference.pdf

Supplementary Material 7 Elements of Group and Representation Theory

Appendix

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture