On the Ideal Number of Groups for Isometric Gradient Propagation
Kim, Bum Jun, Choi, Hyeyeon, Jang, Hyeonah, Kim, Sang Woo
–arXiv.org Artificial Intelligence
These behave similarly in that they apply mean and standard deviation (std) normalization and an affine transform. The difference lies in the units used for computing Recently, various normalization layers have been the mean and std. For example, for n features, layer proposed to stabilize the training of deep neural normalization computes a single mean and std for normalization, networks. Among them, group normalization is a whereas instance normalization computes n means generalization of layer normalization and instance and stds. Meanwhile, group normalization partitions n features normalization by allowing a degree of freedom in into G groups to compute G means and stds. From this the number of groups it uses. However, to determine perspective, layer normalization is a special case of group the optimal number of groups, trial-and-errorbased normalization for G = 1, and instance normalization is a hyperparameter tuning is required, and such special case of group normalization for G = n. Thus, group experiments are time-consuming. In this study, we normalization is more comprehensive and has a degree of discuss a reasonable method for setting the number freedom from the setting of the number of groups.
arXiv.org Artificial Intelligence
Feb-6-2023
- Country:
- Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
- Genre:
- Research Report > New Finding (0.34)
- Technology: