condition hold
- North America > United States > California > Alameda County > Berkeley (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > Canada (0.04)
A Corrections to the main paper 2 2 B Problem setup 3
In the course of preparing the supplementary materials we identified the following two mistakes. For the convenience of the reader we provide the full, corrected table below. C is an appropriatly chosen constant. Frei et al. (2022) Xu & Gu (2023) Theorem 3.1 Theorem 3.6 Theorem 3.8n C log null 1 δ null log null m δ null 1 δ 1 log null m δ null m C 1 log null n δ null log null n δ null log null n δ null log null n δ null γ 1 C 1 n 1 n 1 n 1 nd 1 k γ C 1 nd null log( The same mistake also means that the sentence starting on line 188 "Comparing In order to provide a convenient reference for the reader, we summarize our notation as follows. As such we typically resort to using a generically large enough constant C . For the reader's convenience we recap the data model studied in this work. We assume test data are drawn mutually i.i.d. In regard to the initialization of the network weights, for convenience we assume each neuron's To this end, we introduce the following notation, where p { 1, 1}. P(( B < κT) (T > 0) | w, v > 0) 1 P( T = 0 | w, v > 0) P( B κT | w, v > 0), therefore it suffices to upper bound the two probabilities on the right-hand-side. Using a variant of Hoeffding's bound for sampling without replacement (see Proposition Based on Lemma B.2, the following lemma bounds the probability that " on the counting functions: in particular we write P (i, l) + P (i, l) = P ( i, i) = 1 /2 and hence we conclude p + q = 1 / 2. As a result Observe by the data model, described in Section B.2, that We will often make use of the following similar but more pessimistic bounds on the activations.
Appendix of " Complex-valued Neurons Can Learn More but Slower than Real-valued Neurons via Gradient Descent " A Preliminaries
In this section, we first summarize frequently used notations in the following table. Table 4: Frequently used notations.Notation Description C Lemma 7. Let d = 1 . Combining the cases above completes the proof. Subsection B.2 proves several convergence rate lemmas. Subsection B.3 gives some technical We are now ready to prove Theorem 1. Proof of Theorem 1.
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (2 more...)
Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks
Tang, Xuan, Zhang, Han, Cao, Yuan, Zou, Difan
Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs fundamentally from the stochastic variant used in practice. Unlike SGD, stochastic Adam does not converge to its full-batch counterpart even with infinitesimal learning rates. We present the first theoretical characterization of how batch size affects Adam's generalization, analyzing two-layer over-parameterized CNNs on image data. Our results reveal that while both Adam and AdamW with proper weight decay $λ$ converge to poor test error solutions, their mini-batch variants can achieve near-zero test error. We further prove Adam has a strictly smaller effective weight decay bound than AdamW, theoretically explaining why Adam requires more sensitive $λ$ tuning. Extensive experiments validate our findings, demonstrating the critical role of batch size and weight decay in Adam's generalization performance.
- North America > United States > California > Alameda County > Berkeley (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > Canada (0.04)