2.2 - Optimization algorithms
- Details
- Last Updated: Friday, 20 November 2020 07:38
- Published: Wednesday, 18 November 2020 12:19
- Hits: 945
Optimization algorithms: Course 2 - Week 2
This course goes over how to optimize the algo for finding thew lowest cost. We saw gradient descent that was used to find lowest cost by differentiating the cost function and finding the minima. However, with NN running on lots of data to get trained, the step for finding lowest cost may take a long time. Any improvement in training algo would help a lot.
Several such algo are discussed below:
1. Mini batch gradient descent (mini gd):
We train our NN on m examples, where "m" may be in millions. We vectorized these m examples so that we don't have to run expensive for loops. But even then, it takes a long time to run across m examples. so, we divide our m examples in "k" mini batches with m/k examples in each mini batch. We call our original gradient descent scheme as "Batch gd".
We form a for loop for running each mini batch in one "for" loop. Then there is an outer for loop to iterate "num" times to find lowest cost. Each loop thru all examples is called one "epoch".
With "Batch gd", our cost function would come down with each iteration. However, with mini batch, our cost function is noisy and oscillates up and down, as it comes down to a minima. One hyper parameter to choose is the size of mini batch. 3 possibilities:
- Size of m: When each mini batch contains all "m" examples, then it becomes same as "batch gd". It takes too long when "m" is very large.
- Size of 1: When each mini batch contains only 1 example, then it it's called "stochastic gd". This is the other extreme of batch gd. Here we lose all the advantage of vectorization, as we have a for loop for each example. It's cost function is too noisy as it keeps on oscillating, as it approaches the minima.
- Size of 1<size<m: When each mini batch contains only a subset of example, then it it's called "mini batch gd". This is the best approach, provided we can tune the size of each mini batch. typical mini batch sizes are power of 2, and chosen as 64, 128, 256 and 512. Mini batch size of 1024 is also employed, though it's less common. Mini batch size should be such that it fits in CPU/GPU memory, or else performance will fall off the cliff (as we'll continuously be swapping training set data in and out of memory)
2. Gradient Descent with momentum:
We make an observation when running gradient descent for mini batch, there are oscillations which are due to W, b getting updated with only a small number of examples in each step. When it sees he new mini batch, W, b may get corrected to different value in opposite direction resulting in oscillations. These oscillations are in Y direction (i.e values of weight/bias jumping around) as we approach to a optimal value (of cost function) in x direction. These Y direction oscillations are the ones that we don't want. These oscillations can be reduced by a technique known as "Exponentially weighted avg". Let's see what is it:
Exponentially weighted average:
Here, we average out the new weight/bias value with previous values. So, in effect, any dramatic update to weight/bias in current mini batch, doesn't cause too much change immediately. This smoothes out the curve.
Exponentially weighted avg is defined as:
Vt = beta*Vt + (1-beta)*Xt => Here Xt is the sample number "t". t goes from 0, 1, ... n, where n is the total number of samples.
It can be proved that Vt is approximately avg over 1/(1-beta) samples. so, for beta=0.9, Vt is avg over last 10 samples. If beta=0.98, then Vt is avg over last 50 samples. Higher the beta, smoother will be the curve, as it takes avg over larger number of samples.
It's called exponential as if we expand, Vt we see that Vt contains exponentially decaying "weight" for previous samples. i.e Vt = (1-beta)*Xt + (1-beta)*[ beta*X{t-1} + beta^2*X{t-2} + beta^3*X{t-3} + ...]
It can be shown that weight decays to 1/e when we look at the last 1/(1-beta) sample. So, in effect it seems that it's taking avg of last 1/(1-beta) samples.
However, the above eqn has an issue at startup. We choose V0=0, so first few values of Vt are way off from avg value of Xt, until we have collected few samples of Xt. To fix this, we add a bias correction term as follows:
Vt (bias corrected) = 1/(1-beta^t) * [ beta*Vt + (1-beta)*Xt ] => Here we multiplied the whole term by 1/(1-beta^t), so that for the first few values of "t", Vt becomes a small value, so contribution from Xt dominates. However as "t" starts getting larger, 1/(1-beta^t) goes to 1, and can be seen to have no impact.
gd with momentum:
Now we apply the above technique of "Exponentially weighted avg" to gd with momentum. Instead of using dW and db to update W, b, we use weighted avg of dW and db to update W, b. This results in a much smoother curve, where W, b don't oscillate too much with each iteration.
Instead of doing W=W- alpha*dW, b=b-alpha*db, as we did in our original gd algo, we use weighted avg of dW and db here.
W=W- alpha*Vdw, b=b-alpha*Vdb, where Vdw, Vdb are weight avg of last beta samples of dW and db, and defined as
Vdw = beta1*Vdw+ (1-beta1)*dW, Vdb = beta1*Vdb+ (1-beta1)*db
3. Gradient Descent with RMS prop:
This is a slightly different variant of gd with momentum. Here also, we use the technique of exponentially weighted avg, but instead of using dW and db, we use square of dW and db. Also, we note that in "gd with momentum", we never knew which values of dW and db are oscillating, we smoothed all of them equally. Here, we smooth out those values more which have more oscillations, and vice versa. We achieve this by dividing dW and db by their weight avg (instead of using weighted avg of dW and db directly in the eqn). that way whichever dW or db oscillates by most (may be w1,w7 and w9 oscillate the most), then their derivatives are going to be the largest (dw1, dw7 and dw9 have high values). So, on dividing these large derivatives by larger numbers will smoothen them out more than ones with lower derivatives. Eqn are as follows:
W=W- alpha*dW/√Sdw, b=b-alpha*db/√Sdb, where Sdw, Sdb are weight avg of last beta samples of (dW)^2 and (db)^2, and defined as
Sdw = beta2*Sdw+ (1-beta2)*(dW)^2, Sdb = beta2*Sdb+ (1-beta2)*(db)^2
NOTE: we used beta1 in gd with momentum, and beta2 in gd with RMS prop to distinguish that they are different beta. Also, we add a small epsilon=10^-8 to √Sdw and √Sdb so that we don't run into numerical issue of dividing by 0 (or by a number so small that computer's effectively treat it as 0). so, modified eqn becomes:
W=W- alpha*dW/(√Sdw + epsilon), b=b-alpha*db/(√Sdb + epsilon)
4. Gradient Descent with Adam (Adaptive Moment Estimation):
Adam optimization method took both "gd with momentm" and "gd with RMS prop" and put them together. It works better than both of them, and works extremely well across wide range of applications. Here, we modify RMs prop little bit. Instead of using dW and db with alpha, we use VdW and Vdb with alpha. Then we reduce oscillations even more, since we are applying 2 separate oscillation reducing techniques in one. This technique is called moment estimation, since we are using different moments : dW is called 1st moment, (dW)^2 is called 2nd moment and so on.
So, eqn look like:
W=W- alpha*Vdw/(√Sdw + epsilon), b=b-alpha*Vdb/(√Sdb + epsilon),where Vdw, Vdb, Sdw and Sdb are defined above
Here there are 4 hyper parameters to choose from: alpha needs to be tuned, but beta1, beta2 and epsilon can be chosen as follows:
beta1=0.9, beta2=0.999, epsilon=10^-8. These values work well in practice and tuning these doesn't help much.
5. Learning rate decay:
In all the gd techniques above, we kept learning rate "alpha" constant. However, we observe that learning rate doesn't need to be constant. It can be kept high when we start, as we need to take big steps, but can be reduced as we approach the optimal cost, since smaller steps suffice as we converge to optimal value. The larger steps cause oscillations. so reducing alpha reduces these oscillations, so that it allows us to converge smoothly. this approach is called learning rate decay and there are various techniques to achieve this.
simplest formula for implementing learning rate decay is:
alpha = 1/(1+decay_rate*epoch_num) * alpha0, where alpha0 is our initial learning rate. epoch_num is the current iteration number.
So, as we do more and more iterations, we keep on reducing the decay rate, until it gets close to 0. Now we have one more hyper parameter "decay_rate" on top of alpha0, both of which need to be tuned.
Other formula for implementing learning rate decay are:
alpha = ((decay_rate)^epoch_num) * alpha0 => This also decays learning rate
alpha = (k/√epoch_num) * alpha0 => This also decays learning rate
Some people also manually reduce alpha every couple of hours or days based on run time. No matter what formula is used, they all achieve the same result of reducing oscillations
Conclusion:
Adam and all other techniques discussed above speeds up our NN learning rate. They solve the problem of plateau in gd, where the gradient changes very slowly over a large space, resulting in very slow learning. All these gd techniques speed up this learning process by speeding up learning in x direction. There is also the problem of getting stuck in local mimima, but looks like this is not an issue in NN with large dimensions.This is because, instead of hitting local minima (where the shape is like bottom of trough or valley), we hit saddle points (where th shape is like saddle of horse). For a local minima, all dimensions need to have a shape like a trough, which is highly unlikely for our NN to get stuck at. At least one of the dimensions will have a slope to get us out of this saddle point, and we will keep on making progress w/o ever getting stuck at local minima.
Programming Assignment 1: here we implement different optimization techniques discussed above. We apply our different optimization techniques to 3 layer NN (of blue/red dots):
- batch gd: This is same as what we did in last few exercises. This is just to warm up. Nothing new here.
- min batch gd: We implement mini batch gd by shuffling data randomly across different mini batches.
- Momentum gd: This implements gd with momentum
- Adam gd: This implements Adam - which combines momentum and RMS prop
- model: Now we apply the 3 optimization techniques to our 3 layer NN = mini batch gd, momentum gd and Adam gd.
Here's the link to pgm assigment:
This project has 2 python pgm.
A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.
There is also a dataset that we use to run our model on:
B. opt_utils_v1a.py => this pgm defines various functions similar to what we used in previous assignments
C. test_cr2_wk2.py => This pgm calls functions in above pgms. It implements all optimization techniques discussed above. It loads the dataset, and runs the model with mini batch gd, momentum gd and Adam gd. Adam gd performs the best.
Summary:
By finishing this exercise, we learnt many faster techniques for implementing gradient descent. We would get the same accuracy for all of the gd methods discussed above, it's just that slower gd techniques are going to require a lot more iterations to get there. This can be verified by setting "NUM_EPOCH" to 90,000 in our program.