AITopics

2510.17621

Country: Europe (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Arafat, Jahidul, Tasmin, Fariha, Poudel, Sanjaya

Feature Selection and Regularization in Multi-Class Classification: An Empirical Study of One-vs-Rest Logistic Regression with Gradient Descent Optimization and L1 Sparsity Constraints

arXiv.org Artificial IntelligenceOct-24-2025

Multi-class wine classification presents fundamental trade-offs between model accuracy, feature dimensionality, and interpretability - critical factors for production deployment in analytical chemistry. This paper presents a comprehensive empirical study of One-vs-Rest logistic regression on the UCI Wine dataset (178 samples, 3 cultivars, 13 chemical features), comparing from-scratch gradient descent implementation against scikit-learn's optimized solvers and quantifying L1 regularization effects on feature sparsity. Manual gradient descent achieves 92.59 percent mean test accuracy with smooth convergence, validating theoretical foundations, though scikit-learn provides 24x training speedup and 98.15 percent accuracy. Class-specific analysis reveals distinct chemical signatures with heterogeneous patterns where color intensity varies dramatically (0.31 to 16.50) across cultivars. L1 regularization produces 54-69 percent feature reduction with only 4.63 percent accuracy decrease, demonstrating favorable interpretability-performance trade-offs. We propose an optimal 5-feature subset achieving 62 percent complexity reduction with estimated 92-94 percent accuracy, enabling cost-effective deployment with 80 dollars savings per sample and 56 percent time reduction. Statistical validation confirms robust generalization with sub-2ms prediction latency suitable for real-time quality control. Our findings provide actionable guidelines for practitioners balancing comprehensive chemical analysis against targeted feature measurement in resource-constrained environments.

accuracy, artificial intelligence, machine learning, (16 more...)

2510.14449

Country: North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Consumer Products & Services > Food, Beverage, Tobacco & Cannabis > Beverages (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.45)

arXiv.org Artificial IntelligenceOct-24-2025

EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms

Liu, WenTao, Song, Siyu, Hao, Hao, Zhou, Aimin

In recent years, large language models (LLMs) have made remarkable progress, with model optimization primarily relying on gradient-based optimizers such as Adam. However, these gradient-based methods impose stringent hardware requirements, demanding high-concurrency, high-memory GPUs. Moreover, they require all neural network operations to be differentiable, thereby excluding many promising non-differentiable architectures from practical use. To address these limitations, we propose EA4LLM, an evolutionary algorithm for optimizing LLMs, and, for the first time, empirically verify full-parameter optimization from the pretraining stage across model sizes ranging from 0.5B to 32B. We conduct extensive experiments and provide key insights into how evolutionary algorithms can effectively optimize neural networks. Our work challenges the prevailing assumption that gradient-based optimization is the only viable approach for training neural networks. It also holds significant potential to reduce the computational cost of training large language models, thereby enabling groups with limited computational resources to participate in deep learning research.

evolutionary algorithm, large language model, machine learning, (16 more...)

2510.10603

Country: Asia > China (0.16)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Agrawalla, Bhavya, Balasubramanian, Krishnakumar, Ghosal, Promit

Statistical Inference for Linear Functionals of Online Least-squares SGD when $t \gtrsim d^{1+δ}$

arXiv.org Machine LearningOct-23-2025

In this work, we establish non-asymptotic Berry-Esseen bounds for linear functionals of online least-squares SGD, thereby providing a Gaussian Central Limit Theorem (CL T) in a growing-dimensional regime. To render the theory practically applicable, we further develop an online variance estimator for the asymptotic variance appearing in the CL T and establish high-probability deviation bounds for this estimator. Stochastic gradient descent [56] is a popular optimization algorithm widely used in data science. It is a stochastic iterative method for minimizing the expected loss function by updating model parameters based on the (stochastic) gradient of the loss with respect to the parameters obtained from a random sample. SGD is widely used for training linear and logistic regression models, support vector machines, deep neural networks, and other such machine learning models on large-scale datasets. Because of its simplicity and effectiveness, SGD has become a staple of modern data science and machine learning, and has been continuously improved and extended to handle more complex scenarios. Despite its wide-spread applicability for prediction and point estimation, quantifying the uncertainty associated with SGD is not well-understood. Indeed, uncertainty quantification is a key component of decision making systems, ensuring the credibility and validity of data-driven findings; see, for e.g., [17], for a concrete medical application where it is not enough to just optimize SGD to obtain prediction performance but is more important to quantify the associated uncertainty.

artificial intelligence, assumption 2, machine learning, (11 more...)

2510.19734

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)

Tsikouras, Nikos, Pantis, Yorgos, Mitliagkas, Ioannis, Tzamos, Christos

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

arXiv.org Machine LearningOct-23-2025

Understanding the dynamics of feature learning in neural networks (NNs) remains a significant challenge. The work of (Mousavi-Hosseini et al., 2023) analyzes a multiple index teacher-student setting and shows that a two-layer student attains a low-rank structure in its first-layer weights when trained with stochastic gradient descent (SGD) and a strong regularizer. This structural property is known to reduce sample complexity of generalization. Indeed, in a second step, the same authors establish algorithm-specific learning guarantees under additional assumptions. In this paper, we focus exclusively on the structure discovery aspect and study it under weaker assumptions, more specifically: we allow (a) NNs of arbitrary size and depth, (b) with all parameters trainable, (c) under any smooth loss function, (d) tiny regularization, and (e) trained by any method that attains a second-order stationary point (SOSP), e.g.\ perturbed gradient descent (PGD). At the core of our approach is a key $\textit{derandomization}$ lemma, which states that optimizing the function $\mathbb{E}_{\mathbf{x}} \left[g_θ(\mathbf{W}\mathbf{x} + \mathbf{b})\right]$ converges to a point where $\mathbf{W} = \mathbf{0}$, under mild conditions. The fundamental nature of this lemma directly explains structure discovery and has immediate applications in other domains including an end-to-end approximation for MAXCUT, and computing Johnson-Lindenstrauss embeddings.

artificial intelligence, machine learning, neural network, (13 more...)

2510.19382

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
North America > Canada > Quebec > Montreal (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

arXiv.org Artificial IntelligenceOct-23-2025

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Yu, Kuai, Wu, Xiaoyu, Yan, Peishen, Yang, Qingqian, Jiang, Linshan, Wang, Hao, Hua, Yang, Song, Tao, Guan, Haibing

Federated Learning (FL) enables decentralized model training across multiple clients without exposing local data, but its distributed feature makes it vulnerable to backdoor attacks. Despite early FL backdoor attacks modifying entire models, recent studies have explored the concept of backdoor-critical (BC) layers, which poison the chosen influential layers to maintain stealthiness while achieving high effectiveness. However, existing BC layers approaches rely on rule-based selection without consideration of the interrelations between layers, making them ineffective and prone to detection by advanced defenses. In this paper, we propose POLAR (POlicy-based LAyerwise Reinforcement learning), the first pipeline to creatively adopt RL to solve the BC layer selection problem in layer-wise backdoor attack. Different from other commonly used RL paradigm, POLAR is lightweight with Bernoulli sampling. POLAR dynamically learns an attack strategy, optimizing layer selection using policy gradient updates based on backdoor success rate (BSR) improvements. To ensure stealthiness, we introduce a regularization constraint that limits the number of modified layers by penalizing large attack footprints. Extensive experiments demonstrate that POLAR outperforms the latest attack methods by up to 40% against six state-of-the-art (SOTA) defenses.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2510.19056

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Bacvanski, Marc Gong, Ziyin, Liu, Poggio, Tomaso

On Biologically Plausible Learning in Continuous Time

arXiv.org Artificial IntelligenceOct-22-2025

Biological learning unfolds continuously in time, yet most algorithmic models rely on discrete updates and separate inference and learning phases. We study a continuous-time neural model that unifies several biologically plausible learning algorithms and removes the need for phase separation. Rules including stochastic gradient descent (SGD), feedback alignment (FA), direct feedback alignment (DFA), and Kolen-Pollack (KP) emerge naturally as limiting cases of the dynamics. Simulations show that these continuous-time networks stably learn at biological timescales, even under temporal mismatches and integration noise. Through analysis and simulation, we show that learning depends on temporal overlap: a synapse updates correctly only when its input and the corresponding error signal coincide in time. When inputs are held constant, learning strength declines linearly as the delay between input and error approaches the stimulus duration, explaining observed robustness and failure across network depths. Critically, robust learning requires the synaptic plasticity timescale to exceed the stimulus duration by one to two orders of magnitude. For typical cortical stimuli (tens of milliseconds), this places the functional plasticity window in the few-second range, a testable prediction that identifies seconds-scale eligibility traces as necessary for error-driven learning in biological circuits.

artificial intelligence, machine learning, timescale, (17 more...)

2510.18808

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Sobotka, Jan, Šimánek, Petr, Kordík, Pavel

Enhancing Fractional Gradient Descent with Learned Optimizers

arXiv.org Machine LearningOct-22-2025

Fractional Gradient Descent (FGD) offers a novel and promising way to accelerate optimization by incorporating fractional calculus into machine learning. Although FGD has shown encouraging initial results across various optimization tasks, it faces significant challenges with convergence behavior and hyperparameter selection. Moreover, the impact of its hyperparameters is not fully understood, and scheduling them is particularly difficult in non-convex settings such as neural network training. To address these issues, we propose a novel approach called Learning to Optimize Caputo Fractional Gradient Descent (L2O-CFGD), which meta-learns how to dynamically tune the hyperparameters of Caputo FGD (CFGD). Our method's meta-learned schedule outperforms CFGD with static hyperparameters found through an extensive search and, in some tasks, achieves performance comparable to a fully black-box meta-learned optimizer. L2O-CFGD can thus serve as a powerful tool for researchers to identify high-performing hyperparameters and gain insights on how to leverage the history-dependence of the fractional differential in optimization.

artificial intelligence, l2o-cfgd, machine learning, (13 more...)

2510.18783

Country: North America > United States (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Machine LearningOct-22-2025

Learning under Quantization for High-Dimensional Linear Regression

Zhang, Dechen, Su, Junwei, Zou, Difan

The use of low-bit quantization has emerged as an indispensable technique for enabling the efficient training of large-scale models. Despite its widespread empirical success, a rigorous theoretical understanding of its impact on learning performance remains notably absent, even in the simplest linear regression setting. We present the first systematic theoretical study of this fundamental question, analyzing finite-step stochastic gradient descent (SGD) for high-dimensional linear regression under a comprehensive range of quantization targets: data, labels, parameters, activations, and gradients. Our novel analytical framework establishes precise algorithm-dependent and data-dependent excess risk bounds that characterize how different quantization affects learning: parameter, activation, and gradient quantization amplify noise during training; data quantization distorts the data spectrum; and data and label quantization introduce additional approximation and quantized error. Crucially, we prove that for multiplicative quantization (with input-dependent quantization step), this spectral distortion can be eliminated, and for additive quantization (with constant quantization step), a beneficial scaling effect with batch size emerges. Furthermore, for common polynomial-decay data spectra, we quantitatively compare the risks of multiplicative and additive quantization, drawing a parallel to the comparison between FP and integer quantization methods. Our theory provides a powerful lens to characterize how quantization shapes the learning dynamics of optimization algorithms, paving the way to further explore learning theory under practical hardware constraints.

artificial intelligence, machine learning, quantization, (15 more...)

2510.18259

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Sonthalia, Rishi, Nadakuditi, Raj Rao

Matricial Free Energy as a Gaussianizing Regularizer: Enhancing Autoencoders for Gaussian Code Generation

arXiv.org Machine LearningOct-21-2025

We introduce a novel regularization scheme for autoencoders based on matricial free energy. Our approach defines a differentiable loss function in terms of the singular values of the code matrix (code dimension x batch size). From the standpoint of free probability an d random matrix theory, this loss achieves its minimum when the singular value distribution of the code matrix coincides with that of an appropriately sculpted random metric with i.i.d. Gaussian entries. Empirical simulations demonstrate that minimizing the negative matricial free energy through standard stochastic gradient-based training yields Gaussian-like codes that generalize across training and test sets. Building on this foundation, we propose a matricidal free energy maximizing autoencoder that reliably produces Gaussian codes and show its application to underdetermined inverse problems.

artificial intelligence, autoencoder, machine learning, (16 more...)

2510.1712

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)