AITopics | norm constraint

Country:

Europe > Germany > Lower Saxony > Gottingen (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
(4 more...)

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.96)
Government > Regional Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
(2 more...)

Neural Information Processing SystemsOct-9-2025, 13:05:00 GMT

Towards robust vision by multi-task learning on monkey visual cortex

In contrast, the mammalian visual system is robust to a wide range of perturbations.

artificial intelligence, machine learning, robustness, (16 more...)

Country:

Europe > Germany > Lower Saxony > Gottingen (0.14)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Texas > Harris County > Houston (0.04)
(4 more...)

Genre: Research Report > New Finding (0.94)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Neural Information Processing SystemsOct-2-2025, 12:48:45 GMT

Subset Selection by Pareto Optimization

Chao Qian, Yang Yu, Zhi-Hua Zhou

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, regression, (13 more...)

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.30)

arXiv.org Artificial IntelligenceJun-25-2025

The Effect of Depth on the Expressivity of Deep Linear State-Space Models

Bao, Zeyu, Yu, Penghao, Jiang, Haotian, Li, Qianxiao

Deep state-space models (SSMs) have gained increasing popularity in sequence modelling. While there are numerous theoretical investigations of shallow SSMs, how the depth of the SSM affects its expressiveness remains a crucial problem. In this paper, we systematically investigate the role of depth and width in deep linear SSMs, aiming to characterize how they influence the expressive capacity of the architecture. First, we rigorously prove that in the absence of parameter constraints, increasing depth and increasing width are generally equivalent, provided that the parameter count remains within the same order of magnitude. However, under the assumption that the parameter norms are constrained, the effects of depth and width differ significantly. We show that a shallow linear SSM with large parameter norms can be represented by a deep linear SSM with smaller norms using a constructive method. In particular, this demonstrates that deep SSMs are more capable of representing targets with large norms than shallow SSMs under norm constraints. Finally, we derive upper bounds on the minimal depth required for a deep linear SSM to represent a given shallow linear SSM under constrained parameter norms. We also validate our theoretical results with numerical experiments

artificial intelligence, machine learning, ssm, (17 more...)

2506.19296

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

arXiv.org Artificial IntelligenceMar-16-2025

TNCSE: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Sentence Embeddings

Zong, Tianyu, Shi, Bingkang, Yi, Hongzhu, Xu, Jungang

Unsupervised sentence embedding representation has become a hot research topic in natural language processing. As a tensor, sentence embedding has two critical properties: direction and norm. Existing works have been limited to constraining only the orientation of the samples' representations while ignoring the features of their module lengths. To address this issue, we propose a new training objective that optimizes the training of unsupervised contrastive learning by constraining the module length features between positive samples. We combine the training objective of Tensor's Norm Constraints with ensemble learning to propose a new Sentence Embedding representation framework, TNCSE. We evaluate seven semantic text similarity tasks, and the results show that TNCSE and derived models are the current state-of-the-art approach; in addition, we conduct extensive zero-shot evaluations, and the results show that TNCSE outperforms other baselines.

computational linguistic, machine learning, natural language, (18 more...)

2503.12739

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > Canada > Ontario > Toronto (0.04)
(14 more...)

Genre:

Research Report > New Finding (0.54)
Research Report > Promising Solution (0.34)
Overview > Innovation (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceFeb-24-2025

Muon is Scalable for LLM Training

Liu, Jingyuan, Su, Jianlin, Yao, Xingcheng, Jiang, Zhejun, Lai, Guokun, Du, Yulun, Qin, Yidao, Xu, Weixin, Lu, Enzhe, Yan, Junjie, Chen, Yanru, Zheng, Huabin, Liu, Yibo, Liu, Shaowei, Yin, Bohong, He, Weiran, Zhu, Han, Wang, Yuzhi, Wang, Jianzhou, Dong, Mengnan, Zhang, Zheng, Kang, Yongsheng, Zhang, Hao, Xu, Xinran, Zhang, Yutao, Wu, Yuxin, Zhou, Xinyu, Yang, Zhilin

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjusting the per-parameter update scale. These techniques allow Muon to work out-of-the-box on large-scale training without the need of hyper-parameter tuning. Scaling law experiments indicate that Muon achieves $\sim\!2\times$ computational efficiency compared to AdamW with compute optimal training. Based on these improvements, we introduce Moonlight, a 3B/16B-parameter Mixture-of-Expert (MoE) model trained with 5.7T tokens using Muon. Our model improves the current Pareto frontier, achieving better performance with much fewer training FLOPs compared to prior models. We open-source our distributed Muon implementation that is memory optimal and communication efficient. We also release the pretrained, instruction-tuned, and intermediate checkpoints to support future research.

large language model, machine learning, natural language, (18 more...)

2502.16982

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Kun Gai, Guangyun Chen, Chang-shui Zhang

Learning Kernels with Radiuses of Minimum Enclosing Balls

Neural Information Processing SystemsFeb-11-2025, 18:15:05 GMT

In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L

artificial intelligence, kernel, machine learning, (16 more...)

Country:

Asia > Middle East > Jordan (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.50)

Gupta, Akshat, Prateepamornkul, Phudish, Lu, Maochuan, Alaa, Ahmed, Hartvigsen, Thomas, Anumanchipalli, Gopala

Lifelong Sequential Knowledge Editing without Model Degradation

arXiv.org Artificial IntelligenceFeb-3-2025

Prior work in parameter-modifying knowledge editing has shown that large-scale sequential editing leads to significant model degradation. In this paper, we study the reasons behind this and scale sequential knowledge editing to 10,000 sequential edits, while maintaining the downstream performance of the original model. We first show that locate-then-edit knowledge editing methods lead to overfitting on the edited facts. We also show that continuous knowledge editing using these methods leads to disproportionate growth in the norm of the edited matrix. We then provide a crucial insight into the inner workings of locate-then-edit methods. We show that norm-growth is a hidden trick employed by these methods that gives larger importance to the output activations produced from the edited layers. With this "importance hacking", the edited layers provide a much larger contributions to the model's output. To mitigate these issues, we present ENCORE - Early stopping and Norm-Constrained Robust knowledge Editing. ENCORE controls for overfitting and the disproportionate norm-growth to enable long-term sequential editing, where we are able to perform up to 10,000 sequential edits without loss of downstream performance. ENCORE is also 61% faster than MEMIT and 64% faster than AlphaEdit on Llama3-8B.

large language model, machine learning, natural language, (15 more...)

2502.01636

Country:

Asia > Middle East > Republic of Türkiye (0.14)
Asia > Singapore (0.04)
Asia > Malaysia (0.04)
(5 more...)

Genre: Research Report (0.84)

Industry:

Law (1.00)
Government > Regional Government > North America Government > United States Government (1.00)
Health & Medicine > Therapeutic Area (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)

Olama, Alireza, Lundell, Andreas, Kronqvist, Jan, Ahmadi, Elham, Camponogara, Eduardo

A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning

arXiv.org Artificial IntelligenceJun-26-2024

This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit $\ell_0$ norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.

algorithm, constraint, node, (13 more...)