nulle
Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning
Liu, Bing, Kong, Boao, Lu, Limin, Yuan, Kun, Zhao, Chengcheng
Decentralized learning often involves a weighted global loss with heterogeneous node weights $λ$. We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a $λ$-induced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework $L^2(λ;\mathbb{R}^d)$ and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.
Universality in Transfer Learning for Linear Models
We study the problem of transfer learning and fine-tuning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution.
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Greece (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
A Convergence on Two-Layer Nonlinear Networks We consider the family of neural networks f (x) = 1 p p null
Lemma A.2. Assume W (0), β (0) and b have i.i.d. The proof for (A.5) is similar since V ar( To prove (A.6), since | y With a union bound argument, we can show (A.6). Finally, (A.7) followed from standard Gaussian tail bounds and union bound argument, yielding P(max Under the conditions of Theorem 3.2, we define matrices G(0),H (0) R Under the conditions of Theorem 3.2, if the error bound (3.1) holds for all t = 1, 2,...,t From the feedback alignment updates (A.3), we have for all t T | β Lemma A.5. Assume all the inequalities from Lemma A.2 hold. Under the conditions of Theorem 3.2, if the bound for the weights difference (3.2) holds for all t t We prove the inequality (3.1) by induction. Suppose (3.1) and (3.2) hold for all t = 1, 2,...,t Assume all the inequalities from Lemma A.2 hold.
A Training Configurations
We summarize the data statistics in our experiments in Table 1. For both fully and semi-supervised node classification tasks on the citation networks, Cora, Citeseer and Pubmed, we train our DGC following the hyper-parameters in SGC [5]. Specifically, we train DGC for 100 epochs using Adam [2] with learning rate 0.2. For weight decay, as in SGC, we tune this hyperparameter on each dataset using hyperopt [1] for 10,000 trails. For the large-scale inductive learning task on the Reddit network, we also follow the protocols of SGC [5], where we use L-BFGS [3] optimizer for 2 epochs with no weight decay.
- North America > United States > Illinois (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Illinois (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Universality in Transfer Learning for Linear Models
We study the problem of transfer learning and fine-tuning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution.
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Greece (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
A Training Configurations
We summarize the data statistics in our experiments in Table 1. For both fully and semi-supervised node classification tasks on the citation networks, Cora, Citeseer and Pubmed, we train our DGC following the hyper-parameters in SGC [5]. Specifically, we train DGC for 100 epochs using Adam [2] with learning rate 0.2. For weight decay, as in SGC, we tune this hyperparameter on each dataset using hyperopt [1] for 10,000 trails. For the large-scale inductive learning task on the Reddit network, we also follow the protocols of SGC [5], where we use L-BFGS [3] optimizer for 2 epochs with no weight decay.
- North America > United States > Illinois (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Weak Form Scientific Machine Learning: Test Function Construction for System Identification
Weak form Scientific Machine Learning (WSciML) is a recently developed framework for data-driven modeling and scientific discovery. It leverages the weak form of equation error residuals to provide enhanced noise robustness in system identification via convolving model equations with test functions, reformulating the problem to avoid direct differentiation of data. The performance, however, relies on wisely choosing a set of compactly supported test functions. In this work, we mathematically motivate a novel data-driven method for constructing Single-scale-Local reference functions for creating the set of test functions. Our approach numerically approximates the integration error introduced by the quadrature and identifies the support size for which the error is minimal, without requiring access to the model parameter values. Through numerical experiments across various models, noise levels, and temporal resolutions, we demonstrate that the selected supports consistently align with regions of minimal parameter estimation error. We also compare the proposed method against the strategy for constructing Multi-scale-Global (and orthogonal) test functions introduced in our prior work, demonstrating the improved computational efficiency.
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- North America > United States > Virginia > Hampton (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Overview (0.67)
- Research Report (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Model-Based Reasoning (0.60)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)