# hyperparameter

### How to version control your production machine learning models Algorithmia Blog

Machine learning is about rapid experimentation and iteration, and without keeping track of your modeling history you won't be able to learn much. Versioning lets you keep track of all of your models, how well they've done, and what hyperparameters you used to get there. This post will walk through why data versioning is important, tools to get it done with, and how to version your models that go into production. If you've spent time working with Machine Learning, one thing is clear: it's an iterative process. There are so many different parts of your model--how you use your data, hyperparameters, parameters, algorithm choice, architecture--and the optimal combination of all of those is the holy grail of machine learning.

### Why you should do Feature Engineering first, Hyperparameter Tuning second as a Data Scientist

In fact, the realization that feature engineering is more important than hyperparameter tuning came to me as a lesson -- an awakening and vital lesson -- that drastically changed how I approached problems and handled data even before building any machine learning models. When I first started my first full time job as a research engineer in machine learning, I was so excited and obsessed with building fancy machine learning models without really paying much attention to the data that I had. As a matter of fact, I was impatient. I wanted results so badly that I only cared about squeezing every single percent of performance out of my model. Needless to say, I failed after so many attempts and wondered why.

### Deeper into Deep Neural Networks

In the previous blog, we talked about how the Autoencoders in Keras can help us innovate and solve problems that do not appear solvable even after increasing hidden layers and training time. In that blog, we learned that different model architectures conjured through different insights can help us solve a problem statement elegantly and with much less complexity than blindly stacking up layer after layer. We saw this with a Computer Vision research example which was an Automated De-Blurring problem. Starting with a vanilla Convolutional Neural Network, we ended up witnessing how being smart, innovative and using our insights and instincts, we could solve what seemed like an unsolvable problem which was made possible by the flexibility of the Functional API in Keras. We called adding layers and increasing training time as the'go to' thing to do.

### How to Implement Bayesian Optimization from Scratch in Python

Many methods exist for function optimization, such as randomly sampling the variable search space, called random search, or systematically evaluating samples in a grid across the search space, called grid search. More principled methods are able to learn from sampling the space so that future samples are directed toward the parts of the search space that are most likely to contain the extrema. A directed approach to global optimization that uses probability is called Bayesian Optimization. Take my free 7-day email crash course now (with sample code). Click to sign-up and also get a free PDF Ebook version of the course.

### Key-point detection in flower images using deep learning

You might ask: why 3 convolutional layers? Or why 2 convolutional blocks?We included these numbers as hyperparameters in a hyperparameter search. Together with parameters such as: number of dense layers, dropout level, batch normalization and the number of convolutional filters we did a randomized search to find the optimal combination of hyperparameters.

### #012A Building a Deep Neural Network Master Data Science

In this post we will see what are the building blocks of a Deep Neural Network. We will pick one layer, for example layer $$l$$ of a deep neural network and we will focus on computatons for that layer. Calculation of the forward pass for layer $$l$$ we get as we input activations from the previous layer and as the output we get activations of the current layer, layer $$l$$. It is good to cache the value of $$z {[l]}$$ for calculations in backwardpass. Backward pass is done as we input $$da {[l]}$$ and we get the output $$da {[l-1]}$$, as presented in the following graph.

### Supervised learning explained

Machine learning is a branch of artificial intelligence that includes algorithms for automatically creating models from data. At a high level, there are four kinds of machine learning: supervised learning, unsupervised learning, reinforcement learning, and active machine learning. Since reinforcement learning and active machine learning are relatively new, they are sometimes omitted from lists of this kind. You could also add semi-supervised learning to the list, and not be wrong. Supervised learning starts with training data that are tagged with the correct answers (target values).

### Automatic Classification of Sexual Harassment Cases

In our case, the data was provided by Safecity India, which is a platform launched on 2012, that crowdsources personal stories of sexual harassment and abuse in public spaces [2]. They have collected over 10,000 stories from over 50 cities in India, Kenya, Cameroon, and Nepal. More specifically they provided us a .cvs Additionally to the focal tasks of this project and as part of the NLP channel we decided to automate the category classification based on the sexual harassment case descriptions. Performing this classification task manually is time-consuming and leaving it entirely on the hands of the victim could produce ambiguity in the discrimination of the categories.

### Exascale Deep Learning to Accelerate Cancer Research

Deep learning, through the use of neural networks, has demonstrated remarkable ability to automate many routine tasks when presented with sufficient data for training. The neural network architecture (e.g. number of layers, types of layers, connections between layers, etc.) plays a critical role in determining what, if anything, the neural network is able to learn from the training data. The trend for neural network architectures, especially those trained on ImageNet, has been to grow ever deeper and more complex. The result has been ever increasing accuracy on benchmark datasets with the cost of increased computational demands. In this paper we demonstrate that neural network architectures can be automatically generated, tailored for a specific application, with dual objectives: accuracy of prediction and speed of prediction. Using MENNDL--an HPC-enabled software stack for neural architecture search--we generate a neural network with comparable accuracy to state-of-the-art networks on a cancer pathology dataset that is also $16\times$ faster at inference. The speedup in inference is necessary because of the volume and velocity of cancer pathology data; specifically, the previous state-of-the-art networks are too slow for individual researchers without access to HPC systems to keep pace with the rate of data generation. Our new model enables researchers with modest computational resources to analyze newly generated data faster than it is collected.

### Gap Aware Mitigation of Gradient Staleness

Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA outperforms the currently acceptable gradient penalization method, in final test accuracy. We also provide convergence rate proof for GA. Despite prior beliefs, we show that if GA is applied, momentum becomes beneficial in asynchronous environments, even when the number of workers scales up.