Goto

Collaborating Authors

 inthisway





In this section, we present detailed proofs for the theoretical derivation of Thm. 1, which aims to solvethefollowingoptimizationproblem: min

Neural Information Processing Systems

These assumptions are not strong and can be satisfied in most of environments includes MuJoCo, Atarigamesandsoon. Let f be an Lebesgue integrable function, P and Q are two probability distributions, |f| C,then EP(x)f(x) EQ(x)f(x) CDTV(P,Q) (5) Proof. Suppose there are two actions a1, a2 under state s, and let Q1(s,a1) = u, Q1(s,a2) = v. In this way, we can derive the upper bound of Ea π2Q1(s,a) Ea π1Q1(s,a)asabove. Since both sides of the above equation have the same minimum (here the minima are given by Qk = Q), we can replace the objective in Problem 2 with the upper bound in Eq. (10) and solve therelaxedoptimizationproblem.


Theory-InspiredPath-RegularizedDifferential NetworkArchitectureSearch(SupplementaryFile)

Neural Information Processing Systems

Next, we also report the average gate activate probability in the normal and reduction cells in Figure 1 (b). At the beginning of the search, we initialize the activation probability of each gate to be one. SameasDARTS, we alternatively update the network parameterW and the architecture parameterβ via gradient descent which is detailed in Algorithm 1. When we compute the gradient βFBtrain(W,β), we ignore the second-order Hessian to accelerate the computation which is the sameasfirst-orderDARTS. For brevity, we usually ignore the notation(k) and i and use X(l) to denote the outputX(l) of any sampleXi ( i = 1,,n) in the l-th layer at any iteration.



Appendixto" Auxiliary TaskReweightingfor Minimum-dataLearning " AnonymousAuthor(s) Affiliation Address email

Neural Information Processing Systems

First we remove the dependency on the integral by taking its lower33 boundandupperbound. This is the case whene KLα is large (see Figure 1a). This assumption holds as long as there is at least one task that is related to the main task (having59 a smallKLα), which is reasonable because if all the tasks are unrelated, then reweighing is also60 meaningless. Specifically, we find the results insensitive to the choice ofβ. Only 1000 out of 65392 images are147 labeled.



VisualizingtheEmergenceofIntermediateVisual PatternsinDNNs: SupplementaryMaterial

Neural Information Processing Systems

The visualization results revealed the semantic similarity between categories. Furthermore, Figure 2 shows the projected sample featureg at different iterations of training. Therefore, the probability density off not only depends on its orientation but also its strength. In this way,{π,µ} were updated via the following E-stepandtheM-step. This section provides more discussions on the quantification of knowledge points.