2.1 - Practical Aspects of Deep Learning

Practical Aspects of Deep Learning: Course 2 - Week 1

This course goes over how to choose various parameters for your NN. Designing NN is very iterative process. We have to decide on the number of layers, number of hidden units in each layer, the learning rate to choose, what activation function to use for each layer, etc. Depending on the field or application where NN is being applied, these choices may vary a lot. The only way to find out what works, is to try a lot of possible combinations and see what works best.

 We looked at data set in ML, that typically is divided into training set and test set. We also have a dev set which is a set that we use to test out our various implementation of NN, and once we narrow it down to couple of NN that work best, we try those on test set to finally pick one. Training set is usually 99% of all data, while dev and test set are each small at 1% or less.

Bias and variance:

Underfitting: High Bias: Here training data doesn't fit too well with our ML implementation. The training set error is high, and dev set error is equally high. To resolve underfitting issues, we need NN with more layers, so that we can fit better

Overfitting: High variance: Here training data over fits with our implementation. The training set error is low, but dev set error is high. to resolve overfitting, we use more training data or use regularization schemes (discussed later).

Right fitting: Here data is neither under fitting nor over fitting.

Ideally we want low bias and low variance: implying training set error is low and dev set error is also low. Worst case is when we have high bias and high variance: implying training set error is high and dev set error is even higher, so out ML implementation did bad everywhere. We solve both issues of high bias and high variance by selecting our ML implementation carefully and then deploying additional tactics to reduce bias and variance.

In small data era, we used to do trade offs b/w bias and variance, as improving one worsened the other. But in big data era, we can reduce both bias and variance. Bias can be reduced by adding more layers to our network, while variance can be reduced by adding more training data.

Regularization: 

This is a technique used to reduce the problem of over fitting or high variance. The basic way we prevent over fitting is by spreading out weights, so that we don't allow over reliance on only a small set of weights. This makes our data fit less accurately, and by doing that it prevents over fitting. There are many techniques used to achieve this. Below are 2 of such techniques.

A. L1/L2 regularization:

This is done by lowering the overall weight values so that the weight terms are closer to 0. That way they have less of an impact. You can think of the new NN with lower weights as a reduced NN, where some of the weight terms in that network have vanished. Other way to see is that by having weights close to 0, our activation functions like sigmoid and tanh remain in liner region of their plot, so the whole NN becomes more like linear NN where we are just adding linear portions of all activation functions. This then becomes same as logistic regression, which is just a single layer linear NN.

To achieve regularization, we add sum of weights to the cost term, and try to minimize the new cost (including the weight terms). Then the cost lowering method, will try to keep weights also low so that overall sum of weights remain low. There are 2 types of Regularization:

L1 Regularization: Here we add modulus of weights to cost function:

For Logistic Regression: J(w,b) = 1/m * ∑ L(...) + λ/(2m) ∑ |w| = 1/m * ∑ L(...) + λ/(2m).||w|| , where w is summed over all i/p (i=1 to i=nx).

For L layer NN: Here w is a matrix for each layer. Here Regularization term added is λ/(2m) ∑ ||w[l]|| where we sum over all layers (layer 1 to layer L), adding all weight terms in matrix of each layer. i.e

||w[l]|| =   ∑ ∑ |w[l]i,j| where i=1 to n[l-1], j=1 to n[l], => all terms of matrix are added together (in L layer NN, dim of w[l] is (n[l], n[l-1])

L2  Regularization: Here we add modulus of square of weights to cost function:

For Logistic Regression:  J(w,b) = 1/m * ∑ L(...) + λ/(2m) ∑ |w|2 = 1/m * ∑ L(...) + λ/(2m) ||w||2, where ||w||2 is w.wT over all i/p (i=1 to i=nx).

For L layer NN: This is same as for L1 regularization, except that we do square of each weight term. Here Regularization term added is λ/(2m) ∑ ||(w[l])2|| where we sum over all layers (layer 1 to layer L), adding squares of all weight terms in matrix of each layer. i.e

||(w[l])2||  =   ∑ ∑ (w[l]i,j)2 where i=1 to n[l-1], j=1 to n[l], => all terms of matrix are squared and then added together. This is known as Frobenious norm instead of L2 normalization for historical reasons. L2 normalization is used when dealing with single summation as in Logistic Regression

 When calculating dw[l] (i.e dJ/dw) for L layer NN, we need to differentiate this extra term also. So, it adds an extra term λ/(m).w[l]. Then we updating w[l] = w[l] - α.dw[l] , we now have this extra term. So, w[l] = w[l] - α.(dw[l] + λ/(m).w[l]), where dw[l] refers to original dw[l] that was there before the regularization.

So, new w[l] = (1- α.λ/(m)).w[l] - α.(dw[l]) => So, we see that eqn remains of same form as earlier, except that w gets multiplied by a factor (1- α.λ/(m)). Since this  factor is less than 1, so weights are reduced from their original values. This is why L2 regularization is also called "weight decay", as we are kind of decaying weights by added L2 regularization.

λ = It's called as regularization parameter. It's another hyper parameter that needs to be tuned to see what works best for a given NN. lambda is a reserved keyword in python, so instead of using lambda, we use lambd as variable name for lambda

NOTE: in both cases above, we don't sum "b" (i.e we don't do + λ/(2m).b or + λ/(2m).b2) as that has negligible impact on reducing over fitting.

B. Dropout Regularization:

Here, we achieve regularization by dropping out weight terms, w, randomly on each iteration of cost optimization. This causes our algorithm to not depend on any weight term or a set of weight terms very heavily, since that term may disappear at any time, during any iteration of optimization. This causes the weights to be more evenly distributed, reducing the problem of over fitting. It may seem hanky-panky kind of scheme, but it works well in practice.

Inverted Dropout: A revised and more effective implementation of Dropout is inverted Dropout, where we multiply the activation values appropriately, so that our activation values remain unchanged, irrespective of how many hidden units we dropped.

NOTE: Dropout regularization is always applied only on training data, and NOT on test data. This is obvious, since once the weights are finalized by running dropout on training data, we have to use all those weights on test data.

C. Other Regularization:

1. data augmentation: We'll always achieve better regularization with more data. Instead of getting more data, we can use existing data to augment our triaing set. This can be done by using mirror images of pictures, zoomed in pictures, rotated pictures, etc.

2. Early stopping: This is another approach where we stop our cost optimization loop after a certain number of iteration, instead of allowing it to go for a large number of iterations. This reduces over fitting. L2 regularization is preferred over early stopping, as you can mostly get same oe better variance with L2 regularization than with early stopping.

Normalize inputs:

We normalize input vector x by subtracting it by mean, and dividing it by std deviation (or sq root of variance).

So, Xnormal = (Xorig -µ) / σ where mean (µ) = 1/m * Σ X(i)orig where we sum m samples for each X, and std deviation (σ ) = 1/m *√ ( Σ (X(i)orig- µ)2)

If there are 5 i/p vectors X1,...,X5, then we do this for each of the 5 vectors for all m examples. This helps, as subtracting i/p vectors by mean centers each X around origin. Similarly dividing it by std deviation, normalizes it so that for each dimention, X is scattered around by same range for all dimensions. This makes our i/p vector X more symmetrical, and so finding optimal cost goes more smoothly and faster.

Vanishing/Exploding gradients:

With very deep NN, we have the problem of vaishing or exploding gradients, i.e gradients become too small or too big. Prof Andrew shows it with an example, on how the final weight matrix becomes a matrix with exponent of "L". So, values greater than 1 in weight matrix, start exploding, while values less than 1 start vanishing (as they start going to 0). One of the ways to partially solve this is to initialize the weight matrix correctly.

Initializing Weight matrix:

For any layer l, o/p Z = w1.x1+....+wn.xn. If the number of weights is large, then we want weights w1..wn to be small, so that Z doesn't becomes too large. so, we divide each weight matrix element by n (In reality, we divide it by square root of n). This ensures that our weight elements don't get too big. Initializing to "0" doesn't work, as it's not able to break symmetry.

For random initialization, we multiply as follows:

1. tanh activation function: For tanh. it's called Xavier init, and is done as follows: W[l] = np.random.randn(shape) * np.sqrt(1/n[l-1]). We use size of layer (l-1) instead of "l", since we divide it by input layer size, and size for i/p of layer "l" is n[l-1].

2. Relu activation function. For Relu, it's observed that np.sqrt(2/n[l-1]) works better.

3. Others: Many other variants can be used, and we'll have to just try and see what works best.

Gradient Checking:

Definition of differentiaition of X: dF(x) = Lim(e->0) F(x+e) - F(x-e) / 2e, where e goes to 0 in limiting case.

We use this definition to check for gradient by comparing the value obtained using eqn above compared to the real gradient calculated using our formula. If the difference is large (i.e > 0.001), then we need to doubt the gradients of dw and db calculated using the pgm.

 

Programming Assignment 1: here we learn about how different initialization to weight matrix, results in totally different training accuracy.  We apply our different init mechanism to 3 layer NN:

  • zero initialization: doesn't work, unable to break symmetry. Gives worst accuracy on training set
  • large random initialization: very large weights cause vanishing/exploding gradient problem, so gives poor accuracy on training set.
  • He initialization: This works perfect, as weights are divided by "n" to have lower initial weights, resulting in very high training accuracy

Here's the link to pgm assigment:

Initialization(1).html

This project has 2 python pgm.

A. init_utils.py => this pgm defines various functions similar to what we used in previous assignments

init_utils.py

B. test_cr2_wk1_ex1.py => This pgm calls functions in init_utils. It does all 3 initialization as discussed above. We unknowingly did He initialization in previous week exercise.

test_cr2_wk1_ex1.py

 

Programming Assignment 2: here we use the same 3 layer NN as above. Now we apply different regularization techniques to see which works best. These are the 3 different  regularization applied.

  • No regularization: here test accuracy is lower than training accuracy. This is due to overfitting. Gives high accuracy on training set, but low accuracy on test set
  • L2 regularization: here we apply L2 regularization, which results in lower accuracy on training set, but better accuracy on test set. The parameter lambda can be tuned to achieve higher smoothing or lower smoothing of fitting curve. Very high lambda can result in under fitting, resulting in high bias.
  • Dropout regularization: This works best as we get lower training accuracy, but highest test accuracy.

Here's the link to pgm assigment:

Regularization_v2a.html

This project has 3 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

B. reg_utils.py => this pgm defines various functions similar to what we used in previous assignments.

reg_utils.py

C. test_cr2_wk1_ex2.py => This pgm calls functions in reg_utils. It does all 3 regularization discussed above (inluding no regularization)

test_cr2_wk1_ex2.py

 

Programming Assignment 3: here we employ the technique of gradient checking to find out if our back propagation is computing gradient correctly. This is an optional exercise that can be omitted, as it's not really needed in further AI courses.

Here's the link to pgm assigment:

Gradient+Checking+v1.html

This project has 3 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

B. gc_utils.py => this pgm defines various functions similar to what we used in previous assignments.

gc_utils.py

C. test_cr2_wk1_ex3.py => This pgm calls functions in gc_utils. It does the gradient checking

test_cr2_wk1_ex3.py