Goto

Collaborating Authors

 Gradient Descent


Federated Loss Exploration for Improved Convergence on Non-IID Data

arXiv.org Artificial Intelligence

Federated learning (FL) has emerged as a groundbreaking paradigm in machine learning (ML), offering privacy-preserving collaborative model training across diverse datasets. Despite its promise, FL faces significant hurdles in non-identically and independently distributed (non-IID) data scenarios, where most existing methods often struggle with data heterogeneity and lack robustness in performance. This paper introduces Federated Loss Exploration (FedLEx), an innovative approach specifically designed to tackle these challenges. FedLEx distinctively addresses the shortcomings of existing FL methods in non-IID settings by optimizing its learning behavior for scenarios in which assumptions about data heterogeneity are impractical or unknown. It employs a federated loss exploration technique, where clients contribute to a global guidance matrix by calculating gradient deviations for model parameters. This matrix serves as a strategic compass to guide clients' gradient updates in subsequent FL rounds, thereby fostering optimal parameter updates for the global model. FedLEx effectively navigates the complex loss surfaces inherent in non-IID data, enhancing knowledge transfer in an efficient manner, since only a small number of epochs and small amount of data are required to build a strong global guidance matrix that can achieve model convergence without the need for additional data sharing or data distribution statics in a large client scenario. Our extensive experiments with state-of-the art FL algorithms demonstrate significant improvements in performance, particularly under realistic non-IID conditions, thus highlighting FedLEx's potential to overcome critical barriers in diverse FL applications.


Fast and Accurate Power Load Data Completion via Regularization-optimized Low-Rank Factorization

arXiv.org Artificial Intelligence

Low - rank representat i on learn ing ha s emerged as a powerful tool for recoverin g missing values i n power load data due to its ability to exploit the inherent low - dimensional structures of spatiotemporal measurements. Among various techniques, low - rank factorization models are f a vou red f o r t he ir efficiency and interpretability . Howeve r, their performan ce is highly sensitive to the choice of regularization parameter s, which are typically fixed or manually tuned, resulting in limited generalization capability or slow convergenc e in pra ctica l sc en arios. In this paper, we propo se a Regular ization - optimized Low - Rank Factorization, which introduces a Proportional - Integral - Derivative controller to adaptively adju st the regularization coefficient . Furthe rmore, we provide a detailed algori t hmi c com plex i t y analysis, showing that our method preser ves the computatio nal efficiency of stochastic gradient descent while improving ad aptivity. Experimental results on real - world power load datasets validate the superiority of our method in both imput a tio n acc urac y and training efficiency compared to existi ng baselines.


From Easy to Hard: Building a Shortcut for Differentially Private Image Synthesis

arXiv.org Artificial Intelligence

Differentially private (DP) image synthesis aims to generate synthetic images from a sensitive dataset, alleviating the privacy leakage concerns of organizations sharing and utilizing synthetic images. Although previous methods have significantly progressed, especially in training diffusion models on sensitive images with DP Stochastic Gradient Descent (DP-SGD), they still suffer from unsatisfactory performance. In this work, inspired by curriculum learning, we propose a two-stage DP image synthesis framework, where diffusion models learn to generate DP synthetic images from easy to hard. Unlike existing methods that directly use DP-SGD to train diffusion models, we propose an easy stage in the beginning, where diffusion models learn simple features of the sensitive images. To facilitate this easy stage, we propose to use `central images', simply aggregations of random samples of the sensitive dataset. Intuitively, although those central images do not show details, they demonstrate useful characteristics of all images and only incur minimal privacy costs, thus helping early-phase model training. We conduct experiments to present that on the average of four investigated image datasets, the fidelity and utility metrics of our synthetic images are 33.1% and 2.1% better than the state-of-the-art method.


Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression

arXiv.org Artificial Intelligence

We consider resource allocation problems in multi-user wireless networks, where the goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users. We demonstrate how a state-augmented graph neural network (GNN) parametrization for the resource allocation policy circumvents the drawbacks of the ubiquitous dual subgradient methods by representing the network configurations (or states) as graphs and viewing dual variables as dynamic inputs to the model, viewed as graph signals supported over the graphs. Lagrangian maximizing state-augmented policies are learned during the offline training phase, and the dual variables evolve through gradient updates while executing the learned state-augmented policies during the inference phase. Our main contributions are to illustrate how near-optimal initialization of dual multipliers for faster inference can be accomplished with dual variable regression, leveraging a secondary GNN parametrization, and how maximization of the Lagrangian over the multipliers sampled from the dual descent dynamics substantially improves the training of state-augmented models. We demonstrate the superior performance of the proposed algorithm with extensive numerical experiments in a case study of transmit power control. Finally, we prove a convergence result and an exponential probability bound on the excursions of the dual function (iterate) optimality gaps.


Trustworthy Efficient Communication for Distributed Learning using LQ-SGD Algorithm

arXiv.org Artificial Intelligence

We propose LQ-SGD (Low-Rank Quantized Stochastic Gradient Descent), an efficient communication gradient compression algorithm designed for distributed training. LQ-SGD further develops on the basis of PowerSGD by incorporating the low-rank approximation and log-quantization techniques, which drastically reduce the communication overhead, while still ensuring the convergence speed of training and model accuracy. In addition, LQ-SGD and other compression-based methods show stronger resistance to gradient inversion than traditional SGD, providing a more robust and efficient optimization path for distributed learning systems. With the rapid development of learning models, distributed training has become a fundamental approach to improving model performance and scalability. However, these distributed training systems typically rely on numerous compute nodes working collaboratively, where the synchronization of model parameters and gradients introduces significant communication overhead.


Solving Zero-Sum Convex Markov Games

arXiv.org Artificial Intelligence

We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al. (2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon. We follow a two-step approach. First, leveraging properties of hidden-convex--hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex-proximal Polyak-Lojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.


Random feature approximation for general spectral methods

arXiv.org Machine Learning

Random feature approximation is arguably one of the most widely used techniques for kernel methods in large-scale learning algorithms. In this work, we analyze the generalization properties of random feature methods, extending previous results for Tikhonov regularization to a broad class of spectral regularization techniques. This includes not only explicit methods but also implicit schemes such as gradient descent and accelerated algorithms like the Heavy-Ball and Nesterov method. Through this framework, we enable a theoretical analysis of neural networks and neural operators through the lens of the Neural Tangent Kernel (NTK) approach trained via gradient descent. For our estimators we obtain optimal learning rates over regularity classes (even for classes that are not included in the reproducing kernel Hilbert space), which are defined through appropriate source conditions. This improves or completes previous results obtained in related settings for specific kernel algorithms.


A Study of Hybrid and Evolutionary Metaheuristics for Single Hidden Layer Feedforward Neural Network Architecture

arXiv.org Artificial Intelligence

Training Artificial Neural Networks (ANNs) with Stochastic Gradient Descent (SGD) frequently encounters difficulties, including substantial computing expense and the risk of converging to local optima, attributable to its dependence on partial weight gradients. Therefore, this work investigates Particle Swarm Optimization (PSO) and Genetic Algorithms (GAs) - two population-based Metaheuristic Optimizers (MHOs) - as alternatives to SGD to mitigate these constraints. A hybrid PSO-SGD strategy is developed to improve local search efficiency. The findings indicate that the hybrid PSO-SGD technique decreases the median training MSE by 90 to 95 percent relative to conventional GA and PSO across various network sizes (e.g., from around 0.02 to approximately 0.001 in the Sphere function). RMHC attains substantial enhancements, reducing MSE by roughly 85 to 90 percent compared to GA. Simultaneously, RS consistently exhibits errors exceeding 0.3, signifying subpar performance. These findings underscore that hybrid and evolutionary procedures significantly improve training efficiency and accuracy compared to conventional optimization methods and imply that the Building Block Hypothesis (BBH) may still be valid, indicating that advantageous weight structures are retained during evolutionary search.


EF21 with Bells & Whistles: Six Algorithmic Extensions of Modern Error Feedback

arXiv.org Artificial Intelligence

First proposed by Seide (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime, and when full gradients are compressed, is $O(1/T^{2/3})$, the rate of gradient descent in the same regime is $O(1/T)$). Recently, Richtárik et al. (2021) proposed a new error feedback mechanism, EF21, based on the construction of a Markov compressor induced by a contractive compressor. EF21 removes the aforementioned theoretical deficiencies of EF and at the same time works better in practice. In this work we propose six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum, and bidirectional compression. To the best of our knowledge, several of these techniques have not been previously analyzed in combination with EF, and in cases where prior analysis exists -- such as for bidirectional compression -- our theoretical convergence guarantees significantly improve upon existing results.


PNCS:Power-Norm Cosine Similarity for Diverse Client Selection in Federated Learning

arXiv.org Artificial Intelligence

Federated Learning (FL) has emerged as a powerful paradigm for leveraging diverse datasets from multiple sources while preserving data privacy by avoiding centralized storage. However, many existing approaches fail to account for the intricate gradient correlations between remote clients, a limitation that becomes especially problematic in data heterogeneity scenarios. In this work, we propose a novel FL framework utilizing Power-Norm Cosine Similarity (PNCS) to improve client selection for model aggregation. By capturing higher-order gradient moments, PNCS addresses non-IID data challenges, enhancing convergence speed and accuracy. Additionally, we introduce a simple algorithm ensuring diverse client selection through a selection history queue. Experiments with a VGG16 model across varied data partitions demonstrate consistent improvements over state-of-the-art methods.