A common problem we all face when working on deep learning projects is choosing the learning rate and the optimizer (hyper-parameter). If you’re like me, you find yourself guessing an optimizer and learning rates, then checking if they work (and we’re not alone).

To better understand the effect of optimizer and learning rate choice, I trained the same model 500 times. The results suggest that the correct hyper-parameters are critical to training success, yet can be difficult to find.

In this article, I will discuss a solution to this problem using automated methods to choose the optimal hyper-parameter.

**experimental setup**

I trained a basic convolutional neural network from TensorFlow’s tutorial series, which learns to recognize MNIST points. This is a fairly small network, consisting of two convolutional layers and two dense layers, with a total of about 3,400 weights to train. The same random seed is used for each training.

It should be noted that the results below are for a specific model and dataset. Ideal hyper-parameters will vary for other models and datasets.

(If you would like to devote some GPU time to running a larger version of this experiment on the CIFAR-10, please get in touch).

**Which learning rate works best?**

The first thing we’ll look at is how the learning rate affects model training. In each run, the same model is trained from scratch, only the optimizer and the learning rate differ.

The model was trained with 6 different optimizers: Gradient Descent, Adam, AdGrad, AddDelta, RMS Prop and Momentum. For each optimizer, it was trained on logarithmic intervals ranging from 0.000001 to 100, with 48 different learning rates.

In each run, the network is trained until it achieves at least 97% train accuracy. The maximum time allowed was 120 seconds. The experiments were run on an Nvidia Tesla K80 hosted by FloydHub. The source code is available for download.

For each optimizer, most of the learning rate fails to train the model.

For each optimizer there is a valley shape: too low a learning rate never progresses, and too high a learning rate causes instability and never converges. In the middle, there is a band of “just right” learning rates that train successfully.

To summarize the above, it is important that you choose the right learning rate. Otherwise your network will either fail to train, or take longer to converge.

To explain how each optimizer differs in its optimal learning rate, here is the fastest and slowest model to train for each learning rate across all optimizers. Note that the maximum time throughout the graph is 120s (for example, the network fails to train) – there is no single learning rate that works for each optimizer:

See the wide range of learning rates (from 0.001 to 30) that achieve success with at least one optimizer from the graph above.

**Which optimizer performs best?**

Now that we have identified the best learning rate for each optimizer, let’s compare the performance of training each optimizer with the best learning rate found in the previous section.

Here is the validation accuracy of each optimizer over time. This lets us see how quickly, accurately and stably each performs:

Adam had a relatively wide range of successful learning rates in the previous experiment. Overall, Adam is the best choice out of our six optimizers for this model and dataset.

**How does model size affect training time?**

Now let’s see how the size of the model affects how it is trained.

We will change the size of the model by a linear factor. That factor will linearly scale the number of convolutional filters and the width of the first dense layer, thus almost linearly scaling the total number of weights in the model.

**How does the training time change as the model grows?**

Below is shown the time it took to achieve 96% training accuracy on the model, scaling its size from 1x to 10x.

This is a good result. Our choice of hyper-parameters was not invalidated by linearly scaling the model. This may indicate that hyper-parameter discovery can be performed on a scaled-down version of a network to save computation time.

It also shows that, as the network gets larger, it does no O(n²) work in transforming the model (the linear increase in time is explained by the additional operations performed for training each weight). May go).

This result is even more convincing, as it shows our deep learning framework (TensorFlow here) efficiently.