Gradient Descent
Checklist 1. For all authors (a)
Do the main claims made in the abstract and introduction accurately reflect the paper's Did you discuss any potential negative societal impacts of your work? Did you state the full set of assumptions of all theoretical results? If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The code will Did you specify all the training details (e.g., data splits, hyperparameters, how they Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Did you include the total amount of compute and the type of resources used (e.g., type Did you include any new assets either in the supplemental material or as a URL? [N/A] Did you discuss whether and how consent was obtained from people whose data you're If you used crowdsourcing or conducted research with human subjects... (a) We trained backdoored model for 100 epochs using Stochastic Gradient Descent (SGD) with an initial learning rate of 0.1 on CIFAR-10 and the ImageNet subset (0.01 on GTSRB), a weight decay of The learning rate was divided by 10 at the 20th and the 70th epochs. The details of backdoor triggers are summarized in Table 5. ASR: attack success rate; CA: clean accuracy.
Andforanyα>0,theLaplaciansatisfies Gλ c1 2α Gλ+2Md whereM =c23α/c1+c22. 2. IfG (x)=kg (x)k2 isC1,then k G (x)k c2kg (x)k,g (x) > G (x) c1G (x). Proof. Claim1.Note Gλ =2 2Fλgλandthat 1 2 c1I 2Fλ= mX
As a remark, using the Krylov-Bogoliubov existence theorem (see Corollary 11.8 of [6]), fixed points to(4)exist as long as one can show{ρt,t 0}istight. The learning rate is set differently foreachtask. Obviously, the HV indicator (Eq.(10)) can also be used as an objective function for optimizing solution sets. For example, [25, 7] greedily add new points to obtain the highest expected HV improvement. However, the landscape of the HV indicator is piece-wise constant (similar to the 0-1 loss in classification) and is difficult to optimize with gradient descent. Particularly, for all the dominated points inthe solution set, their gradient iszero.
Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes
Sharma et al. (2022) provide Y ang et al. (2022a) integrate Local SGDA with stochastic gradient estimators to eliminate the More recently, Zhang et al. (2023) adopt compressed momentum methods with Local SGD to increase the communication efficiency of the algorithm. For centralized nonconvex minimax problems, Y ang et al. (2022b) show that, even in deterministic settings, GDA-based methods necessitate the timescale separation of the stepsizes for primal and dual updates.