1.4 - Deep Neural Network
- Details
- Last Updated: Sunday, 17 January 2021 04:17
- Published: Sunday, 18 October 2020 13:51
- Hits: 782
Course 1 - week 4 - Deep Neural Network:
This is week 4 of Course 1. Here we generalize NN from 1 hidden layer to any number of hidden layers. Maths get complicated, but it's repeating the same thing as in 2 Layer NN. 2 layer NN has one hidden layer and 1 output layer. L layer neural network has (L-1) hidden layers and 1 output layer. We don't count input layer in the number of layers.
There are few formulas here for forward and backward propagation. these form the backbone of DNN. These formula are summarized here:
There is very good derivation of all these equations here:
https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60
There are 2 programming assignments in this week: First we build a 2 layer NN and then a L layer NN to predict cat vs non-cat in given pictures. 2 Layer NN is just a repeat from last week's exercise, while L layer NN is generalization of 2 layer NN.
Programming Assignment 1: Here we build helper functions to help build a deep NN. We also build helper function for a 2 layer NN separately.
Here's the link to pgm assigment:
Building_your_Deep_Neural_Network_Step_by_Step_v8a.html
This project has 3 python pgm, that we need to understand.
A. testCases_v4a.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned on.
B. dnn_utils_v2.py => this is a pgm that defines couple of functions.
These functions are:
- sigmoid(): This calculates sigmoid for a given Z (Z can be scalar or an array). Output returned is both A (which is sigmoid of Z), and cache (which is same as i/p Z)
- sigmoid_backward(): This calculates dZ given dA and Z. dZ = dA*σ(Z)*[1-σ(Z)]. We stored Z in cache (in sigmoid() above)
- relu(): This calculates relu for a given Z (Z can be scalar or an array). Output returned is both A (which is relu of Z), and cache (which is same as i/p Z)
- relu_backward(): This calculates dZ given dA and Z. dZ = dA for A>0 else dZ=0. We stored Z in cache (in relu() above)
We'll import this file in our main pgm.
C. test_cr1_wk4_ex1.py => This pgm just defines the helper functions that we'll call in our 2 layer and L layer NN model that we define in assignment 2. Below is the whole pgm:
Below are the functions defined in our pgm:
- initialize_parameters() => This function exactly same as previous week's function for 2 Layer NN. Input to func is size of i/p layer, hidden layer and output layer. It initializes W1,b1 and W2,b2 arrays. W1, W2 are init with random values (Very important to have random values instead of 0), while b1,b2 are init to 0. It puts these 4 arrays in dictionary "parameters" and returns that. NOTE: To be succinct, we will use w,b to mean W1,b1,W2,b2, going forward.
- initialize_parameters_deep() =>This initializes w,b for L layer NN (same as for 2 layer NN, but extended to L laeys). i/p is an array containing sizes of all the layers, while o/p is initialized W1,b1, W2, b2, .... WL,bL for L layer NN. All weights are bias are stored in dictionary "parameters"
- Forward functions: These are functions for forward computation:
- linear_forward() => It computes output Z, given i/p A, W, b. Z = np.dot(W,A)+b. this is calculated for a single layer, using i/p A (which is the o/p from previous layer) and computing Z. It returns Z and linear_cache which is a tuple containing (Aprev,W,b), where Aprev is for previous layer, while W, b are for current layer.
- linear_activation_forward() => This computes activation A for Z that we calculated above for layer "l". The reason we separated out the 2 functions for computing Z and A, is because A requires 2 diff functions, sigmoid or relu for computing A (depending on which one we want to use for current layer. sigmoid is used for output layer, while relu is used for all other layers). This keeps code clean.
- We call following functions:
- linear_forward() => returns Z, linear_cache
- sigmoid() => returns A, activation_cache
- relu() => returns A, activation_cache
- We store all relevant values in tuple cache:
- linear_cache => stores tuple (Aprev,W,b), where Aprev is for previous layer, while W, b are for current layer.
- activation_cache => stores computed Z for current layer
- cache => stores tuple (linear_cache, activation_cache) = (Aprev, W, b, Z). In previous week example, we used cache to store A, Z for both layers (A1, Z1, A2, Z2), but here we store W, b too for each layer on top of A (for previous layer) and Z (for current layer) in tuple cache.
- The function finally returns A for current layer and cache. So, we end up returning (Aprev, W, b, Z, A), where Aprev is for previous layer, while W, b, Z, A are for current layer.
- We call following functions:
- L_model_forward() => This function does forward computation staring from i/p X, and generating o/p Y hat (i.e output AL for last layer L). This is same as forward_propagation() function that we used in last week's example. It's just more complicated now, since it involves L layers now, instead of having just 2 layers. We define tuple "caches", which is just all cache appended.
- From layer 1 to layer (L-1) (hidden layers), we call function linear_activation_forward() with "Relu" function in a for loop (L-1) times
- In each loop, cache and A are returned for that layer. A is used in next iteration, while cache is appened to tuple "caches"
- For last layer L (o/p layer), we again call function linear_activation_forward(), but this time with "sigmoid" function
- cache and AL are returned for last layer. AL is going to be used in compute_cost() function (defined below), while cache is appened to tuple "caches"
- From layer 1 to layer (L-1) (hidden layers), we call function linear_activation_forward() with "Relu" function in a for loop (L-1) times
- compute_cost() => computes cost (which is the log function of AL,Y).
- Backward functions: These are functions for forward computation. They are the same as their forward conterpart, just going backward from layer L to layer 1.
- linear_backward() => This is the backward counterpart of linear_forward() func. Given i/p cache and dZ for a given layer, it computes gradients dW, db, dA. Input cache stores tuple (Aprev, W, b). NOTE: dW computation requires A from previous layer
- A_prev, W, b = cache
- dW = 1/m * np.dot(dZ,A_prev.T)
- db = 1/m * np.sum(dZ,axis=1,keepdims=True)
- dA_prev = np.dot(W.T,dZ)
- linear_activation_backward() => This is the backward counterpart of linear_activation_forward() func. Instead of computing A from Z, this computes dA for previous layer given dA (from which dZ is computed) for current layer.
- We call following functions (same as what used in linear_activation_forward(), but now in backward dirn):
- sigmoid_backward() => returns dZ given dA for sigmoid func
- relu_backward() => returns dZ given dA for relu func
- linear_backward()=> using dZ returned by sigmoid/relu backward func above, it computes dA_prev, which is dA for previous layer (since we are going in reverse dirn)
- The function finally returns dA for previous layer and dW, db for current layer.
- We call following functions (same as what used in linear_activation_forward(), but now in backward dirn):
- L_model_backward() => This is the backward counterpart of L_model_forward(). This function does backward computation staring from o/p Y hat (i.e output AL for last layer L) and going all the way to the input X. It returns dictionary "grads" containing dW, db, dA.
- dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
- dA{L-1}, dWL, dbL => computed using func linear_activation_backward() for layer L. Uses dAL from above as i/p to this func
- Now, we run a loop from layer L-1 to layer 1 to compute dA, dW, db for each layer "l"
- dA{l-1}, dWl, dbl => computed using func linear_activation_backward() for layer "l". Uses dAl from prev iteration as i/p to this func to compute dA{l-1}. It uses dA{L-1} from above for l=L-1 to compute da{L-2} and then keeps on iterating backward.
- Finally it returns dictionar grads containing dW, db, dA for each layer
- linear_backward() => This is the backward counterpart of linear_forward() func. Given i/p cache and dZ for a given layer, it computes gradients dW, db, dA. Input cache stores tuple (Aprev, W, b). NOTE: dW computation requires A from previous layer
- update_parameters() => This function is same as that in previous week exercise. computes new w,b given old w,b and dw,db, using the learning rate provided. This is done for w,b for all layers 1 to L (i.e W1=W1-learning_rate*dW1, b1=db1-learning_rate*dW1, .... , WL=WL-learning_rate*dWL, bL=dbL-learning_rate*dWL)
- 2_layer_model()/L_layer_model => These are the main func, but they are not called here. They are part of assignment 2.
Programming Assignment 2: Here we use helper functions defined above in assignment 1 to help build a 2 Layer shallow NN and a L layer deep NN. We find optimal weights using training data and then apply those weights on test data to predict whether the picture has a cat or not.
Here's the link to pgm assigment:
Deep+Neural+Network+-+Application+v8.html
This project has 2 python pgm, that we need to understand.
A. dnn_app_utils_v3.py => this is a pgm that defines all the functions that we defined in assignment 1 above (both from dnn_utils_v2.py and test_cr1_wk4_ex1.py). So, either we can use our functions from assignment 1 or use functions in here. If you wrote all functions in assignment 1 correctly, then it should match all functions in this pgm below (except for few difference noted below).
The few differences to note in above pgm are:
- load_data() function: This function is extra here. It is exactly same as load_dataset() that we used in week2 assignment to load cat vs no cat dataset. Here too we load the same cat vs no cat dataset that's in h5 file.
- predict(): This prints accuracy for any i/p set X (which can have multiple pictures in it). It uses w,b and generates output y hat for the given X. If y hat > 0.5, it predicts it as cat, else non cat. It then compares the results to actual y values, and prints accuracy. It only calls 1 function=> L_model_forward. In returns probability array "p" for all pictures. I added extra var "probas" (which is the output value y hat), so that we can how close or far off were different predictions, even if they were correct or wrong. This gives us a sense of how our algorithm is doing.
- print_mislabeled_images(): This takes as i/p dataset X,Y along with predicted Y hat, and plots all images that aren't same as what was predicted (i.e wrongly classified)
- IMPORTANT: initialize_parameters_deep() function: This function is same as what we wrote in assignment 1 above, with a subtle difference. Here we use a different number to initialize w. Instead of multiplying the random number by 0.01, we multiply it by 1/ np.sqrt(layer_dims[l-1]) for a given layer l. As you will see, this causes a lot of difference in getting the cost low. With 0.01, our cost starts at 0.693148, and remains at 0.643985 at 2400 iteration. Accuracy for training data remains low at 0.65. However, using the new sqrt multiplier, our cost starts at 0.771749, and goes down to 0.092878 at iteration 2400, giving us a training data accuracy of 0.98.
We'll import this file in our main pgm below
B. test_cr1_wk4_ex2.py => This pgm calls functions in dnn_app_utils_v3.py. Here, we define our algorithm for 2 layer NN and L layer NN by calling functions defined above. We find optimal weights, by trying out algorithm on training data.. We then apply those weights on test data to see how well our NN predicts cat vs non cat. Below is the whole pgm:
Below are the functions defined in our pgm:
- two_layer_model() => This function implements a 2 layer NN. It is mostly same as previous week's function for 2 Layer NN which was called nn_model(). The big difference is that we used tanh() function for hidden layer, while here we'll use relu function for hidden layer. Input to func is size of i/p layer, hidden layer and output layer. On top of that we provide i/p dataset X, o/p dataset Y and a learning rate. The function returns optimal W1,b1,W2,b2., These are the steps in this function:
- calls func initialize_parameters() to init w,b
- It then iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls these functions:
- linear_activation_forward() => Given values of X,W1,b1, it calls func linear_activation_forward() with relu to get A1. It then calls linear_activation_forward() again with A1,W2,b2 and sigmoid to get A2. computes A2(i.e Y hat). It returns A2 and cache.
- compute_cost() => Given A2,Y, it computes cost
- Then it calc initial back propagation for dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
- linear_activation_backward => Given dA2 and cache, it calls linear_activation_backward to get dA1, dW2, db2. It then calls linear_activation_backward() again with dA1 and cache to get dA0, dW1,db1. It stores dW1,db1,dW2,db2 in dictionary grads.
- update_parameters() => This computes new values of parameters using old parameters and gradients from grads.
- In beginning, w and b are initialized. We start the loop and in first iteration, we run the 4 functions listed above to get new w,b based on dw, db, and learning rate chosen. Then we start with next iteration. In next iteration, we repeat the process with newly computed values of w,b fed into the 4 functions to get even newer dw, db, and update w,b. We keep on repeating this process for "num_iterations", until we get optimal w,b which hopefully give lot lower cost than what we started with.
- It then returns dictionary "parameters" containing optimal W1,b1,W2,b2
- L_layer_model() => This function implements a L layer NN. It's just an extension of 2 layer NN. Input to func is size of i/p layer, hidden layer and output layer. On top of that we provide i/p dataset X, o/p dataset Y and a learning rate. The function returns optimal W1,b1,...,WL,bL., These are the steps in this function:
- calls func initialize_parameters_deep() to init w,b
- It then iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls these functions:
- L_model_forward() => Given values of X,W1,b1, it calls func linear_activation_forward() with relu to get A1. It then calls linear_activation_forward() again with A1,W2,b2 and sigmoid to get A2. computes A2(i.e Y hat). It returns A2 and cache.
- compute_cost() => Given A2,Y, it computes cost
- Then it calc initial back propagation for dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
- L_model_backward => Given dA2 and cache, it calls linear_activation_backward to get dA1, dW2, db2. It then calls linear_activation_backward() again with dA1 and cache to get dA0, dW1,db1. It stores dW1,db1,dW2,db2 in dictionary grads.
- update_parameters() => This computes new values of parameters using old parameters and gradients from grads.
- In beginning, w and b are initialized. We start the loop and in first iteration, we run the 4 functions listed above to get new w,b based on dw, db, and learning rate chosen. Then we start with next iteration. In next iteration, we repeat the process with newly computed values of w,b fed into the 4 functions to get even newer dw, db, and update w,b. We keep on repeating this process for "num_iterations", until we get optimal w,b which hopefully give lot lower cost than what we started with.
- It then returns dictionary "parameters" containing optimal W1,b1,...,WL,bL
Below is the explanation of main code (after we have defined our functions as above):
- We load our datset X,Y by using func load_data(). We then flatten X and normalize X (by dividing it by 255)
- We then run 2 NN on our data: 1 is 2 layer NN, while other is L layer NN. We can choose which one to run by setting appr variable. Size of i/p layer for both examples below is fixed to 12288 (64*64*3 which is the total number of data points associated with 1 picture). Size of o/p layer is fixed to 1 (since our o/p contains just 1 entry: 0 or 1 for cat vs non cat). Size of hidden layers is what we can play with, since it can be varied to any number we want.
- 2 layer NN:
- We call two_layer_model() on this X,Y training dataset. We give dim of i/p layer, hidden layer and output layer, and set num of iterations to 2500. Hidden layer size is set to 7.
- Then we call predict() to print accuracy on both training data and test data which is pretty low as expected.
- Then we print mislabeled images by calling func print_mislabeled_images.
- L layer NN:
- We call function L_layer_model() with i/p X,Y training dataset and number of hidden layers set to 3 (So, it's a 4 layer NN).
- Then we call predict() to print accuracy of L NN on both training data and test data, which is lot higher than 2 layer NN.
- Then we print mislabeled images by calling func print_mislabeled_images.
- 2 layer NN:
- Then we run the NN (2 Layer or L layer depending on which one is chosen) on our 10 picture dataset that I downloaded from internet (same as what we used in lecture 1, week 2 example). These are all cat pictures. In predict(), we return "y hat" also, so we are able to see all predicted values.
Results:
On running above pgm, we see these results:
2 layer NN: It achieves 99.9% accuracy on training data, but only 72% on test data.
Cost after iteration 0: 0.693049735659989
...
Cost after iteration 2400: 0.048554785628770226
Accuracy: 0.9999999999999998
Accuracy: 0.72
When I run it thru my 10 random cat pictures downloaded from internet, I get 90% accuracy. Below are the A (y hat) value and the final predicted value . As can be seen, accuracy is very low at 60%. Even for ones that were predicted correctly, y hat activation values are not 99% for all correct ones.
Accuracy: 0.6
prediction A [[0.2258492 0.88753723 0.04103057 0.97642935 0.87401607 0.85904489 0.49342905 0.99138362 0.96587573 0.3834667 ]]
prediction Y [[0. 1. 0. 1. 1. 1. 0. 1. 1. 0.]]
4 layer NN: It achieves 99% accuracy on training data and 80% accuracy on test data. For 1st layer, size=20, 2nd layer size=7, 3rd layer size=5 and 4th layer size=1 (since it's o/p layer). size of i/p layer is 12288.
Cost after iteration 0: 0.771749
.........
Cost after iteration 2400: 0.092878
Accuracy: 0.9856459330143539
Accuracy: 0.8
As in 2 Layer NN, when we run 4 layer NN thru the same 10 random cat pictures, I get 90% accuracy which is lot higher than 2 layer NN. Below are the A (y hat) value and the final predicted value . As can be seen, even though accuracy is 90%, the algorthm completely failed for picture 10 which is reported as 0.2, even though it's a perfect cat picture (may be the background color made all the difference. Will need to check it with different background color to see if it makes any difference). The other picture that is right on borderline is the 6th picture. Here, may be too much background noise (things around the cat) is causing the issue. Will need to check with different background to see if that helps.
Accuracy: 0.9
prediction A [[0.99930858 0.97634997 0.96640157 0.9999905 0.95379876 0.5026841 0.92857836 0.99693246 0.99285584 0.21739979]]
prediction Y [[1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]]
Initialization of w,b: If we used the initialization multiplying factor of 0.01 instead of 1 / np.sqrt(layer_dims[l-1]), we'll get a very bad accuracy: 65% on training set and 34% on test set. even worse is the fact that on our 10 random cat images, we get 0% accuracy. This all with just using a different initialization number for different layers. Perhaps this will be explored in next lecture series.
This is what initialization multiplying factor is for different layers (instead of using constant 0.01 for all layers, we increase this value as size of layer increases):
l= 1 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(12288) = 0.009
l=2 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(20) = 0.22
l=3 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(7) = 0.38
l=4 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(5) = 0.45
NOTE: Do NOT forget to change this multiplying factor of 0.01 if you plan to use your own functions from assignment 1 above.
Summary:
Here we build a 2 layer NN (with 1 hidden layer) as well as a L layer NN (with L-1 hidden layers). We can play around with lot of parameters here to see if our L layer NN (here we chose L=4) performs better with more layers, or more hidden units in each layer, or with a different initialized values, or different learning rates, etc. It's hard to say which of these values will give us the optimal results without trying it out. This will be the topic for Course 2 series.