Goto

Collaborating Authors

 trgt





Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Neural Information Processing Systems

In modern deep learning, it is common to warm up the learning rate $\eta$, often by a linear schedule between $\eta_{\text{init}} = 0$ and a predetermined target $\eta_{\text{trgt}}$. In this paper, we show through systematic experiments with SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $\eta_{\text{trgt}}$ by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger target learning rates in turn makes hyperparameter tuning more robust while improving the final performance of the network. We uncover different regimes of operation during the warmup period, depending on whether the network training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam, which provides benefits similar to warmup.


Adaptive Slimming for Scalable and Efficient Speech Enhancement

Miccini, Riccardo, Kim, Minje, Laroche, Clément, Pezzarossa, Luca, Smaragdis, Paris

arXiv.org Artificial Intelligence

Speech enhancement (SE) enables robust speech recognition, real-time communication, hearing aids, and other applications where speech quality is crucial. However, deploying such systems on resource-constrained devices involves choosing a static trade-off between performance and computational efficiency. In this paper, we introduce dynamic slimming to DEMUCS, a popular SE architecture, making it scalable and input-adaptive. Slimming lets the model operate at different utilization factors (UF), each corresponding to a different performance/efficiency trade-off, effectively mimicking multiple model sizes without the extra storage costs. In addition, a router subnet, trained end-to-end with the backbone, determines the optimal UF for the current input. Thus, the system saves resources by adaptively selecting smaller UFs when additional complexity is unnecessary. We show that our solution is Pareto-optimal against individual UFs, confirming the benefits of dynamic routing. When training the proposed dynamically-slimmable model to use 10% of its capacity on average, we obtain the same or better speech quality as the equivalent static 25% utilization while reducing MACs by 29%.



Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Neural Information Processing Systems

Proposition 1. Suppose that any signal The total observation loss is defined in Equation equation 4 below. After introducing some notation, we will formalize the assumptions made in the proposition. Definition 2. We define the scattering map as the (measurable) map sending signal In other words, given all possible observations of a signal, we can uniquely reconstruct the signal (for the class of signals under consideration). Observations generated by our model are slices of total observations. Thus, our model is limited to modeling the space over observations that are a member of the total observations set, i.e., The predicted distribution over signals can be recovered from the distribution over observations.



Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Neural Information Processing Systems

In modern deep learning, it is common to warm up the learning rate \eta, often by a linear schedule between \eta_{\text{init}} 0 and a predetermined target \eta_{\text{trgt}} . In this paper, we show through systematic experiments with SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger \eta_{\text{trgt}} by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger target learning rates in turn makes hyperparameter tuning more robust while improving the final performance of the network. We uncover different regimes of operation during the warmup period, depending on whether the network training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how \eta_{\text{init}} can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup.


Why Warmup the Learning Rate? Underlying Mechanisms and Improvements

Kalra, Dayal Singh, Barkeshli, Maissam

arXiv.org Machine Learning

It is common in deep learning to warm up the learning rate $\eta$, often by a linear schedule between $\eta_{\text{init}} = 0$ and a predetermined target $\eta_{\text{trgt}}$. In this paper, we show through systematic experiments using SGD and Adam that the overwhelming benefit of warmup arises from allowing the network to tolerate larger $\eta_{\text{trgt}}$ by forcing the network to more well-conditioned areas of the loss landscape. The ability to handle larger $\eta_{\text{trgt}}$ makes hyperparameter tuning more robust while improving the final performance. We uncover different regimes of operation during the warmup period, depending on whether training starts off in a progressive sharpening or sharpness reduction phase, which in turn depends on the initialization and parameterization. Using these insights, we show how $\eta_{\text{init}}$ can be properly chosen by utilizing the loss catapult mechanism, which saves on the number of warmup steps, in some cases completely eliminating the need for warmup. We also suggest an initialization for the variance in Adam which provides benefits similar to warmup.