parameter
Penalising the biases in norm regularisation enforces sparsity
Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions.Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.
Parameter tuning and model selection in Optimal Transport with semi-dual Brenier formulation
Over the past few years, numerous computational models have been developed to solve Optimal Transport (OT) in a stochastic setting, where distributions are represented by samples and where the goal is to find the closest map to the ground truth OT map, unknown in practical settings. So far, no quantitative criterion has yet been put forward to tune the parameter of these models and select maps that best approximate the ground truth. To perform this task, we propose to leverage the Brenier formulation of OT. Theoretically, we show that this formulation guarantees that, up to sharp a distortion parameter depending on the smoothness/strong convexity and a statistical deviation term, the selected map achieves the lowest quadratic error to the ground truth. This criterion, estimated via convex optimization, enables parameter tuning and model selection among entropic regularization of OT, input convex neural networks and smooth and strongly convex nearest-Brenier (SSNB) models.We also use this criterion to question the use of OT in Domain-Adaptation (DA). In a standard DA experiment, it enables us to identify the potential that is closest to the true OT map between the source and the target. Yet, we observe that this selected potential is far from being the one that performs best for the downstream transfer classification task.
Make Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning
Parameter-efficient fine-tuning (PEFT) of pre-trained language models (PLMs) has emerged as a highly successful approach, with training only a small number of parameters without sacrificing performance and becoming the de-facto learning paradigm with the increasing size of PLMs. However, existing PEFT methods are not memory-efficient, because they still require caching most of the intermediate activations for the gradient calculation, akin to fine-tuning. One effective way to reduce the activation memory is to apply a reversible model, so the intermediate activations are not necessary to be cached and can be recomputed. Nevertheless, modifying a PLM to its reversible variant is not straightforward, since the reversible model has a distinct architecture from the currently released PLMs. In this paper, we first investigate what is a key factor for the success of existing PEFT methods, and realize that it's essential to preserve the PLM's starting point when initializing a PEFT method.
TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning
Efficient on-device learning requires a small memory footprint at training time to fit the tight memory constraint. Existing work solves this problem by reducing the number of trainable parameters. However, this doesn't directly translate to memory saving since the major bottleneck is the activations, not parameters.
Reviews: A Unified Approach for Learning the Parameters of Sum-Product Networks
The single contribution of the paper which is relevant in practice is an alternative derivation of an existing method (Expectation Maximization for learning SPN weights). While this is an interesting result, I think that it does not grant alone a publication in NIPS since it's hard to imagine how this can contribute to better theoretical understanding or practical applications of SPNs. The interpretation of SPNs as mixtures of tree structured SPNs, which is reported as a novelty by the authors, was actually first derived in [Dennis and Vantura, Greedy Structure Search for Sum-Product Networks, 2015]. The paper is overall well written, clearly structured and the derivation of the results is really interesting. My main concern, as detailed above, is that in my opinion the potential impact of this paper is low, and the novelty is also somewhat limited due to the fact that the interpretation of SPN as mixture of trees was already given in [Dennis and Vantura, Greedy Structure Search for Sum-Product Networks, 2015] and that this is basically just an alternative derivation of EM.
Penalising the biases in norm regularisation enforces sparsity
Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a \sqrt{1 x 2} factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator.
Reviews: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks
If so, I am confused why this is highlighted as a virtue of adding noise, since the purely deterministic dynamics of GD also evince this behavior. Numerical experiments: These are slightly hard to interpret. First, which plots show SGD dynamics, and which are for GD? Second, I'm puzzled by how to interpret the dotted lines in each plot. In the case of RBF, how are we to make sense of the empirical n {-2} decay? Is this somehow predicted in the analysis of the GD, or is it an empirical phenomenon which is not theoretically addressed in this work.
Tuning Random Forest model Machine Learning Predictive modeling
A month back, I participated in a Kaggle competition called TFI. I started with my first submission at 50th percentile. Having worked relentlessly on feature engineering for more than 2 weeks, I managed to reach 20th percentile. To my surprise, right after tuning the parameters of the machine learning algorithm I was using, I was able to breach top 10th percentile. This is how important tuning these machine learning algorithms are.
#32afdf375aa2
Machine learning has been successfully applied to demand planning, but leading suppliers of supply chain planning are beginning to work on using machine learning to improve production planning. But architecturally and culturally, this is a much tougher problem than machine learning applied to demand planning. In the $2 billion-plus supply chain planning market, ARC Advisory Group's latest market study shows production planning as being a critical application SCP solution representing over 25 percent of the total market. Production planning applications are used for both planning daily production at a factory to creating weekly or monthly plans to divvy up the production tasks that need to be accomplished across multiple factories. Machine learning is a form of continuous improvement.
An overview of gradient descent optimization algorithms
Note: If you are looking for a review paper, this blog post is also available as an article on arXiv. Added derivations of AdaMax and Nadam. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.