Theory-InspiredPath-RegularizedDifferential NetworkArchitectureSearch(SupplementaryFile)

Neural Information Processing Systems 

Next, we also report the average gate activate probability in the normal and reduction cells in Figure 1 (b). At the beginning of the search, we initialize the activation probability of each gate to be one. SameasDARTS, we alternatively update the network parameterW and the architecture parameterβ via gradient descent which is detailed in Algorithm 1. When we compute the gradient βFBtrain(W,β), we ignore the second-order Hessian to accelerate the computation which is the sameasfirst-orderDARTS. For brevity, we usually ignore the notation(k) and i and use X(l) to denote the outputX(l) of any sampleXi ( i = 1,,n) in the l-th layer at any iteration.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found