Search
TNASP: ATransformer-basedNASPredictorwitha Self-evolution Framework-SupplementaryMaterials
When replacing our Transformer with GCN, we can get the models almost the same as the ones applied inNP (GCN) [14], which isobviously worse than our method. Moreover, our method did not explicitly put the validation data into the training dataset as the pseudo label techniquedid. The comparisons with other methods are summarized in Tab.8 and we visualize our searched architectures inSec.C.2. Inthe MobileNet-like search space, we retrain the searched architecture for 240 epochs with batch size 1024 on8NVIDIAV100 GPUs.
on ResNet-50 and by 7.3% on MobileNetV2
Our gains are indeed large. EvoNorm-S0 is the state-of-the-art in the small batch size regime (Table 4), outperforming BN-ReLU by 7.8% We achieve clear gains over other influential works such as GroupNorm (GN). We'd also like to emphasize that EvoNorms beat BN-ReLU on 12 (out of 14) different classification models/training These are significant considering the predominance of BN-ReLU in ML models. R3: "the overall search algorithm lacks some novelty." "yet another AutoML paper" (with the expectation that some fancy search algorithms must be proposed), but rather under R2, R4: Can EvoNorms generalize to deeper variants (e.g., ResNet-101) and architecture families not included MnasNet, EfficientNet-B5, Mask R-CNN + FPN/SpineNet and BigGAN-none of them was used during search.
Achieving Near-Optimal Convergence for Distributed Minimax Optimization with Adaptive Stepsizes
Sharma et al. (2022) provide Y ang et al. (2022a) integrate Local SGDA with stochastic gradient estimators to eliminate the More recently, Zhang et al. (2023) adopt compressed momentum methods with Local SGD to increase the communication efficiency of the algorithm. For centralized nonconvex minimax problems, Y ang et al. (2022b) show that, even in deterministic settings, GDA-based methods necessitate the timescale separation of the stepsizes for primal and dual updates.