clipping
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Taming Fat-Tailed ("Heavier-Tailed" with Potentially Infinite Variance) Noise in Federated Learning
In recent years, federated learning (FL) has emerged as an important distributed machine learning paradigm to collaboratively learn a global model with multiple clients, while keeping data local and private. However, a key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise.
On Optimal Hyperparameters for Differentially Private Deep Transfer Learning
Rehn, Aki, Zhao, Linzh, Heikkilä, Mikko A., Honkela, Antti
Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound $C$ and batch size $B$. We show a clear mismatch between the current theoretical understanding of how to choose an optimal $C$ (stronger privacy requires smaller $C$) and empirical outcomes (larger $C$ performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning $B$ do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single $(C,B)$ setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Finland > Uusimaa > Helsinki (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Research Report (1.00)
- Overview > Innovation (0.34)
Gradient Shaping Beyond Clipping: A Functional Perspective on Update Magnitude Control
Gradient clipping is widely used to stabilize deep network training, but its formulation as a hard, fixed threshold limits flexibility and ignores gradient distribution dynamics. We propose SPAMP (Statistical Per-layer Adaptive Modulation and Projection), a unified framework that generalizes clipping into smooth, per-layer gradient shaping. SPAMP tracks local gradient statistics, dynamically estimates thresholds, and applies power-based transformations to modulate update magnitudes in a differentiable manner. This perspective recasts clipping and warmup as dual mechanisms for controlling the effective update scale $η_t \|g_t\|$, offering a principled alternative to rigid heuristics. Extensive experiments across image and language tasks demonstrate that SPAMP improves stability, convergence, and robustness over existing methods.
- Asia > Malaysia > Kuala Lumpur > Kuala Lumpur (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Tianjin Province > Tianjin (0.04)
- Asia > China > Hebei Province > Shijiazhuang (0.04)
A Proofs for Fat T ailed Federated Learning
A.1 Proof of FAT-Clipping - PR For notional clarity, we have the following update: Local update: x The first inequality follows from the strongly-convex property, i.e., Assumption 4. (Bounded Stochastic Gradient V ariance) There exists a constant Assumption 5. (Bounded Gradient) There exists a constant We remark that for any stochastic estimator satisfies the above conditions, the above inequalities hold. The proof is the exactly same as that in original proof [18]. Theorem 6. Suppose f is We run a convolutional neural network (CNN) model on CIFAR-10 dataset using FedAvg. CNN architecture is shown in Table 2. To simulate data heterogeneity across clients, we manually The dataset and model are taken from [45]. This implies that the gradient noise is fat-tailed.
Taming Fat-Tailed ("Heavier-Tailed" with Potentially Infinite Variance) Noise in Federated Learning
In recent years, federated learning (FL) has emerged as an important distributed machine learning paradigm to collaboratively learn a global model with multiple clients, while keeping data local and private. However, a key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. Specifically, for the largest \alpha \in (1,2] such that the fat-tailed noise in FL still has a bounded \alpha -moment, we show that both variants achieve \mathcal{O}((mT) {\frac{2-\alpha}{\alpha}}) and \mathcal{O}((mT) {\frac{1-\alpha}{3\alpha-2}}) convergence rates in the strongly-convex and general non-convex settings, respectively, where m and T are the numbers of clients and communication rounds.
Private and Communication-Efficient Federated Learning based on Differentially Private Sketches
Zhang, Meifan, Xie, Zhanhong, Yin, Lihua
Federated learning (FL) faces two primary challenges: the risk of privacy leakage due to parameter sharing and communication inefficiencies. To address these challenges, we propose DPSFL, a federated learning method that utilizes differentially private sketches. DPSFL compresses the local gradients of each client using a count sketch, thereby improving communication efficiency, while adding noise to the sketches to ensure differential privacy (DP). We provide a theoretical analysis of privacy and convergence for the proposed method. Gradient clipping is essential in DP learning to limit sensitivity and constrain the addition of noise. However, clipping introduces bias into the gradients, negatively impacting FL performance. To mitigate the impact of clipping, we propose an enhanced method, DPSFL-AC, which employs an adaptive clipping strategy. Experimental comparisons with existing techniques demonstrate the superiority of our methods concerning privacy preservation, communication efficiency, and model accuracy.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- (7 more...)
Random Gradient Masking as a Defensive Measure to Deep Leakage in Federated Learning
Federated Learning (FL)[1][2] emerged as an artificial intelligence training method that does not require sending data from peripheral devices(clients) to a central server. Rather, each client would download the central model from the server, train it over their private data, and send the resulting gradients of the private training back to the server, all of which are aggregated by a server-side algorithm to produce the next iteration of the central model. Ideally, mutually distrusted clients never communicate their private data, and yet they produce a central model that encompasses the entire clients' data. Extensive research is being conducted on optimizing the learning efficiency of FL on various aspects such as incentive mechanisms[3], communication speed[4], non-IID training[5], and client selection[6]. However, recent research reveals that sending the gradients of private training does not ensure complete data privacy, especially in a wide cross-device environment[7]. Moreover, as a federated system, FL has to protect itself against Byzantine Failure[8], Backdoor injection[9], Model Poisoning[10], and Data Poisoning[11]).
- North America > United States (0.04)
- Asia > South Korea (0.04)
Boosting Soft Q-Learning by Bounding
Adamczyk, Jacob, Makarenko, Volodymyr, Tiomkin, Stas, Kulkarni, Rahul V.
An agent's ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California (0.04)
- North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
- Asia > Middle East > Jordan (0.04)