Large-batchOptimizationforDenseVisualPredictions
–Neural Information Processing Systems
At thet-th backward propagation step, we can derive the gradient il(wt)toupdatei-th module inM. The number in the bracket represents the batch size. We see that when the batch size is small (i.e.,32), the gradientvariancesaresimilar. N and K indicate the number of FPN levels and region proposals fed into the detection head. To evaluate this assumption, as shown in Figure 1, we have three observations. As illustrated by the second figure in Figure 1, the gradient misalignment phenomenon between detection head and backbone has been reduced.
Neural Information Processing Systems
Feb-19-2026, 06:21:58 GMT
- Technology: