Goto

Collaborating Authors

 nulle


Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning

Liu, Bing, Kong, Boao, Lu, Limin, Yuan, Kun, Zhao, Chengcheng

arXiv.org Artificial Intelligence

Decentralized learning often involves a weighted global loss with heterogeneous node weights $λ$. We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a $λ$-induced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework $L^2(λ;\mathbb{R}^d)$ and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.


Universality in Transfer Learning for Linear Models

Neural Information Processing Systems

We study the problem of transfer learning and fine-tuning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution.


A Convergence on Two-Layer Nonlinear Networks We consider the family of neural networks f (x) = 1 p p null

Neural Information Processing Systems

Lemma A.2. Assume W (0), β (0) and b have i.i.d. The proof for (A.5) is similar since V ar( To prove (A.6), since | y With a union bound argument, we can show (A.6). Finally, (A.7) followed from standard Gaussian tail bounds and union bound argument, yielding P(max Under the conditions of Theorem 3.2, we define matrices G(0),H (0) R Under the conditions of Theorem 3.2, if the error bound (3.1) holds for all t = 1, 2,...,t From the feedback alignment updates (A.3), we have for all t T | β Lemma A.5. Assume all the inequalities from Lemma A.2 hold. Under the conditions of Theorem 3.2, if the bound for the weights difference (3.2) holds for all t t We prove the inequality (3.1) by induction. Suppose (3.1) and (3.2) hold for all t = 1, 2,...,t Assume all the inequalities from Lemma A.2 hold.


A Training Configurations

Neural Information Processing Systems

We summarize the data statistics in our experiments in Table 1. For both fully and semi-supervised node classification tasks on the citation networks, Cora, Citeseer and Pubmed, we train our DGC following the hyper-parameters in SGC [5]. Specifically, we train DGC for 100 epochs using Adam [2] with learning rate 0.2. For weight decay, as in SGC, we tune this hyperparameter on each dataset using hyperopt [1] for 10,000 trails. For the large-scale inductive learning task on the Reddit network, we also follow the protocols of SGC [5], where we use L-BFGS [3] optimizer for 2 epochs with no weight decay.


On the Tightness of Semidefinite Relaxations for Certifying Robustness to Adversarial Examples

Neural Information Processing Systems

If the relaxation is loose, however, then the resulting certificate can be too conservative to be practically useful. Recently, a less conservative robustness certificate was proposed, based on a semidefinite programming (SDP) relaxation of the ReLU activation function.



Universality in Transfer Learning for Linear Models

Neural Information Processing Systems

We study the problem of transfer learning and fine-tuning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution.


A Training Configurations

Neural Information Processing Systems

We summarize the data statistics in our experiments in Table 1. For both fully and semi-supervised node classification tasks on the citation networks, Cora, Citeseer and Pubmed, we train our DGC following the hyper-parameters in SGC [5]. Specifically, we train DGC for 100 epochs using Adam [2] with learning rate 0.2. For weight decay, as in SGC, we tune this hyperparameter on each dataset using hyperopt [1] for 10,000 trails. For the large-scale inductive learning task on the Reddit network, we also follow the protocols of SGC [5], where we use L-BFGS [3] optimizer for 2 epochs with no weight decay.



Weak Form Scientific Machine Learning: Test Function Construction for System Identification

Tran, April, Bortz, David

arXiv.org Artificial Intelligence

Weak form Scientific Machine Learning (WSciML) is a recently developed framework for data-driven modeling and scientific discovery. It leverages the weak form of equation error residuals to provide enhanced noise robustness in system identification via convolving model equations with test functions, reformulating the problem to avoid direct differentiation of data. The performance, however, relies on wisely choosing a set of compactly supported test functions. In this work, we mathematically motivate a novel data-driven method for constructing Single-scale-Local reference functions for creating the set of test functions. Our approach numerically approximates the integration error introduced by the quadrature and identifies the support size for which the error is minimal, without requiring access to the model parameter values. Through numerical experiments across various models, noise levels, and temporal resolutions, we demonstrate that the selected supports consistently align with regions of minimal parameter estimation error. We also compare the proposed method against the strategy for constructing Multi-scale-Global (and orthogonal) test functions introduced in our prior work, demonstrating the improved computational efficiency.