Goto

Collaborating Authors

 krf


Contents of the Appendix

Neural Information Processing Systems

A.1 CIFAR-10 dataset Figure 6 displays test accuracy curves for all six backbone algorithms under three distinct imbalance parameters: 2{ 0.3,1,10}. The results clearly demonstrate that FedNAR outperforms the baselines, particularly in scenarios with imbalanced data. A.2 Shakespeare dataset The experimental results presented in Figure 7 and 8 showcase the outcomes of experiments performed on the Shakespeare dataset. Six backbone algorithms were utilized, with initial weight decay values selected from {10 3,10 4}. These findings serve as evidence that FedNAR, as an adaptive weight decay scheduling algorithm, exhibits effectiveness across various initial weight decay values.


A single gradient step finds adversarial examples on random two-layers neural networks

Neural Information Processing Systems

Daniely and Schacham [2020] recently showed that gradient descent finds adversarial examples on random undercomplete two-layers ReLU neural networks. The term "undercomplete" refers to the fact that their proof only holds when the number of neurons is a vanishing fraction of the ambient dimension. We extend their result to the overcomplete case, where the number of neurons is larger than the dimension (yet also subexponential in the dimension). In fact we prove that a single step of gradient descent suffices. We also show this result for any subexponential width random neural network with smooth activation function.



002262941c9edfd472a79298b2ac5e17-Supplemental-Conference.pdf

Neural Information Processing Systems

A.1 Proof Sketch We first introduce the following lemma: Lemma 1. Lemma 2. For matrices A,B 2Mn, if A B, then we have min(A) min(B)and max(A) max(B), where max() (resp., min()) denotes taking the maximum (resp., minimum) eigenvalue.. Proof of Lemma 2. For any matrix P 2Mn with P> = P, we have max(P) = max We first consider the condition number of ˆH when X is in a locally convex area. By equations 3 and 4, we have M1 H M2. Rearranging the terms yields H M1 0 and M2 H 0. Therefore, for any vector x 2RM, we have We next consider the minimum singular value of H and ˆH with min(H)= p min(H2) and min(ˆH)= q min(ˆH2) in any case. Under Assumption 1 and equation 4, we have H M2. Similarly, we can obtain H M2. By Lemma 2, we further have max(H) max(M2)= nmax 2 C.1 kr ˆf(ˆX) k2 vs. krf(X) k2 In this section, we explain why we use kr ˆf(ˆX) k2 rather than kr f(X) k2 to characterize the convergence rate. In general, it is hard to develop a convergence rate for objective values. However, when the global model is in a locally convex area of f, we can obtain the relationship between the gradient and the local optimum.





A Theory of Formalisms for Representing Knowledge

arXiv.org Artificial Intelligence

There has been a longstanding dispute over which formalism is the best for representing knowledge in AI. The well-known "declarative vs. procedural controversy" is concerned with the choice of utilizing declarations or procedures as the primary mode of knowledge representation. The ongoing debate between symbolic AI and connectionist AI also revolves around the question of whether knowledge should be represented implicitly (e.g., as parametric knowledge in deep learning and large language models) or explicitly (e.g., as logical theories in traditional knowledge representation and reasoning). To address these issues, we propose a general framework to capture various knowledge representation formalisms in which we are interested. Within the framework, we find a family of universal knowledge representation formalisms, and prove that all universal formalisms are recursively isomorphic. Moreover, we show that all pairwise intertranslatable formalisms that admit the padding property are also recursively isomorphic. These imply that, up to an offline compilation, all universal (or natural and equally expressive) representation formalisms are in fact the same, which thus provides a partial answer to the aforementioned dispute.


Supplementary Material Outline

Neural Information Processing Systems

Such independent samples can be obtained by querying the SO at (x, y) for three times. A.2 Technical Lemmas for Lipschitz Properties and Hessian Inverse Estimation We first restate Lemmas 2.2 of (Ghadimi and Wang, 2018) to characterize the smoothness properties of y Lemma A.1 Suppose Assumptions 3.3 and 3.4 hold. Throughout this section, we assume Assumptions 3.1, 3.2, 3.3, and 3.4 hold and the step-sizes follow (5) that q q Therefore, under Assumption 3.3, for all t apple T, for all 1 apple j apple b, we have E[ku B.2 Lemma B.2 and Its Proof We quantify the convergence behavior of consensus errors under the choices of step-sizes (5) and (6) as follows. Lemma B.2 Suppose Assumptions 3.1, 3.2, 3.3, and 3.4 hold and the step-sizes satisfy Lemma B.3 Suppose Assumptions 3.1, 3.2, 3.3, and 3.4 hold. B.7 Proof of Theorem 5.1 Proof: We start our analysis by considering the term kȳ Throughout this subsection, we assume Assumptions 3.1, 3.2, 3.3, 3.4, and 5.2 hold. C.1 Lemma C.1 and Its Proof Lemma C.1 Suppose Assumptions 3.1, 3.2, 3.3, 3.4, and 5.2 hold and the objective F satisfies µ-PL Assumption 5.2 in addition.


A Proof of Theorem 1, A2, B1, B

Neural Information Processing Systems

A.1 Proof Sketch We first introduce the following lemma: Lemma 1. We first consider the condition number of Ĥ when X is in a locally convex area. In general, it is hard to develop a convergence rate for objective values. However, when the global model is in a locally convex area of f, we can obtain the relationship between the gradient and the local optimum. Theorem 4. When there is no parameter heat dispersion, and X is in a µ-strongly convex area of f We note that there is a difference between equation 18 and 21: for each client i, equation 18 involves all the parameters of the full model while equation 21 involves only partial parameters of the submodel, which causes a change in the lower bound of T (Y) and further leads to a change of conclusion.