Course 1 - week 2 - Neural Network Basics:

This is the first technical introduction to NN. Well, the material for this week doesn't really talk about NN, it talks about regression, and how to do a linear and logistic regression. But in later weeks, you will see that these regressions are the simplest kind of NN. Logistic regression is a concept from statistics, but this defines the building block for AI.

For Linear and Logistic regression, see the AI section on "Statistics - Regression". This is all this week's lecture is about. Trying to do binary classification on a picture with nx pixels, to find out if it's a cat or not. First, we give m such training pictures to our regression engine, let it find weights which gives it the lowest cost, and then use the weights to predict on a test picture. If our weights are optimal, and the test picture is close to our training set picture, then our regression algo would do a good job in classifying the picture correctly.

However, just from common sense it looks like this approach of simple regression will never work, as cats can come in any color, shape, position, background, etc. Regression is just matching pixels and trying to minimize distance, it has no spatial information (i.e if there are 10 pixels next to each other to form a eye, then our logistic regression model doesn't care if these 10 pixels are on 10 different corners of the picture, or they are next to each other).

As an example, consider 8X8 pixel black and white picture. Each pixel can have 2 values: 0 for black and 1 for white. So, total possibilities of all pictures possible is 2^(8*8)=2^64 unique pictures possible. Our regression analysis is trying to go thru limited set of such possible combinations and predict what each picture is going to be. It's impossible to do that even for 8x8 pixel black and white picture. Just imagine how to do that for 64x64 colored picture !! And then for even larger pictures. It's just not possible by brute force "least error" regression technique. Something better has to be done. That's for later courses !!

 This week has a programming assignment, that is an absolute must to be completed, if you want to learn AI. It helps you go thru simplest NN that's possible, which is actually logistic regression. All new concepts are developed. Take your time to finish this assignment.

Programming Assignment 1: This is a simple image recognition pgm. It reads a file of images to get trained (using whatever algorithm we use, here we use logistc regression), and then we run the pgm on test images to see how well our algorithm works.

Here's the link to pgm assigment:

Logistic_Regression_with_a_Neural_Network_mindset_v6a.html

This project has 2 python pgm, that we need to understand.

A. lr_utils.py => this is a pgm that defines a function "load_dataset". We'll import this file in our main pgm. However, instead of writing it as a separate pgm, I copied the function defined in this file in the main python pgm.

The function load_dataset() reads 2 files: test data and training data. Below are the two h5 files that contain our training data and test data. Feel free to download the 2 files by right clicking and choosing "save link as" (If you directly click on the link below, it will open the h5 file in the browser itself, which will look garbage as it's not a text file that browser knows how to display):

train_catvnoncat.h5

test_catvnoncat.h5

1. training data: This data is used to train our algo. It has 209 training data set with label="train_set_x". It has 209 2D pictures, which are each 64x64 pixels, and each picture has a triplet of R,G,B values

2. testing dat: This data is used to test our algo. It has 50 testing data set with label="test_set_x". It has 50 2D pictures, which are each 64x64 pixels, and each picture has a triplet of R,G,B values.

 Below I'm writing the function "load_dataset" from lr_utils.py

import numpy as np
import h5py
    
def load_dataset():
    train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
    train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features. We store this data into an array of 209X64X64X3
    train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels. This stores the type=0 for non cat and 1 for cat corresponding to 209 pictures.It's a 1D array with 209 elements, but since it's 1D, we convert it to 2D array as shown later

    test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
    test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features. Similarly for test set, we have 50 pictures, array is 50X64X64X3
    test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels. This stores the type for these 50 pictures

    classes = np.array(test_dataset["list_classes"][:]) # the list of classes

    print("train = ",train_dataset, "test = ",test_dataset, "classes = ",classes,classes.shape)

    print("OLD", train_set_x_orig.shape, train_set_y_orig.shape, test_set_x_orig.shape, test_set_y_orig.shape)
    train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
    test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
    print("NEW", train_set_x_orig.shape, train_set_y_orig.shape, test_set_x_orig.shape, test_set_y_orig.shape)
    return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes

 

result:

train =  <HDF5 file "train_catvnoncat.h5" (mode r)> test =  <HDF5 file "test_catvnoncat.h5" (mode r)> classes =  [b'non-cat' b'cat'] (2,) => train_dataset, test_dataset are just pointers. classes is a 1D array with just 2 string values [non-cat cat]

OLD (209, 64, 64, 3) (209,) (50, 64, 64, 3) (50,) => The y labels are 1D array here
NEW (209, 64, 64, 3) (1, 209) (50, 64, 64, 3) (1, 50) => They y labels have been converted into 2D array here (X labels are still 4D array)

 

B. test_cr1_wk2.py => This pgm calls func load_dataset() defined in lr_utils, and we define our algorithm for logistic regression here to find optimal weights, by trying out algorithm on training data.. We then apply those weights on test data to predict whether the picture has a cat or not.

Below is the whole pgm, including the function defined in lr_utils.py

test_cr1_wk2.py

Below are the functions defined in our pgm:

  • sigmoid() => defines sigmoid func for any input z
  • initialize_with_zeros() => initializes w,b arrays with 0
  • propagate() => computes total cost. Given X, w, b, this func calculates activation A (which is the sigmoid function of linear eqn w1*x1+... wn*xn +b) and then computes cost (which is the log function of A,Y). Then it computes gradients dw, db. It stores dw, db in dictionary "grads". It returns scalar "cost" and dictionary "grads"
  • optimize() => This function iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls function propagate() with given values of X,w,b. In beginning, w and b are 0. propagate() returns new dw,db. Then it updates w,b with new values based on dw, db, and learning arte chosen. Then it starts with next iteration. In next iteration, it feeds newly computed values of w,b into propagate() to get even newer dw, db, and updates w,b. It keeps on repeating this process for "num_iterations", until it gets to w,b which hopefully give lot lower cost than what we started with.
  • predict() => Given input picture array X, it predicts Y (i.e whether pic is cat or not). It uses w,b calculated using optimize function. We can provide a set of "n" pictures here in single array X (we don't need to provide each pic individually as an array). This is done for efficiency purpose, as Prof Andrew explains multiple times in his courses.
  • model() => This is the main func that will be called in our pgm. We provide both training and test pictures as 2 big arrays as i/p to this func. It calls above functions as shown below:
    • calls func initialize_with_zeros() to init w,b,
    • then calls optimize() to optimize w,b to give lowest cost across the training set.
    • It then calls predict() to predict on any picture. predict is called twice for both training set and test set to predict cat vs non cat.
    • Accuracy is then reported for all pictures on what they actually were vs what our pgm predicted.

Below is the explanation of main code (after we have defined our functions as above):

  1. We load the datset X,Y for m pictures stored in h5 files.
  2. Then we enter in a loop, where we can repeat running this program as many times as we want for whatever reason. NOT really needed.
  3. Inside the loop, we flatten and normalize array X that we read from dataset in h5 file. We flaltten array of R,G,B pixels for each picture into shape(nx*nx*3,1). This flattening is done since our weight array also flattened. We want one weight for each pixel, so both weight and pixel value have to be 1D array, so that we can just multiply them directly as w1*x+w2*x2+...+wn*xn. In our implementation of this in numpy, we make them 2D array, but they still have only 1 row or col filled (i.e they behave like 1D).
  4. Now we run function model() on array X (which already has m training pics in it), and find optimal w,b by running it on training set. Function model() then runs prediction() and reports prediction accuracy for both training and test set.
  5. Then we have a choice of trying various learning rates, and see the effect on minimal cost achieved by our pgm. Learning rates matter a lot, as we see by trying small/large rates.
  6. Then finally we have a choice of trying 10 diff random images (these images are in all_data dir), which are predicted by calling predict(). Prediction value for each image is reported. We see that accuracy is bad (about 50%). Here we used Image module from PIL library. I couldn't get "imread" from matplotlib to work.

Summary:

By finishing this exercise, we learnt how to do logistic regression to figure out optimal weight for each pixel of a picture so that it can predict a cat vs non cat picture.

 

Intro to Deep Learning: Course 1 - Week 1

This is very introductory material.

Neural Network (NN) is just taking a dataset and fitting it with a eqn i.e given input features X1, X2, ... Xn, and a output Y, which we try to fit with a complex eqn Y = F(X1,X2,...,Xn). Once we find this best fit eqn F, we use this to predict Y given X1,X2,...Xn.

Process of getting this eqn is called network training. The term neural network came into being, since this complex eqn that we get resembles a chain of neurons passing information from one to the next, until we get to the output stage. From statistics, we know how to find best fit, but those eqn are flat (i.e Y=A*X+b*X^2+C*X^3...,). However, they never worked well on fitting new data, but these neural network based fitting eqn work well on new data too. They are very good with unstructured data (i.e identifying cat from a picture), while conventional fitting algorithms were good with only structured data (i.e predicting price of a house based on age, size, location, etc).

Diff kind of NN:

1. Standard NN

2. Convolutional NN

3. Recurrent NN

Deep Learning (DL): NN are called deep when they have a lot of layers. Reason, DL is getting so popular is because they work amazingly well. Reason for them working so well is due to the fact that deep neural network keep improving their prediction accuracy with more and more data, while earlier methodologies saturated and their prediction accuracy didn't improve even if they were loaded with more data.

 DL is very compute intensive since it needs to run thru large number of layers on lots of data.

 

Linear Functions:

Before we look into best fit functions, let's look at linear functions. Linear functions are functions that  satisfy these 2 requirements:

1. f(a*x) = a*f(x)

2. f(x+y) = f(x) + f(y)

These 2 requirements can be combined into one as f(a*x+b*y) = a*f(x)+b*f(y)

Linear functions are important as they state that any scaling and summation of linear functions is also linear and can be computed easily be separating the terms out. The single order polynomial f(x)=m*x+b is a linear function, while polynomials of higher order as f(x) = a*x^2 + b*x + c aren't.  But not all functions which look like linear are linear. We'll see examples below.

Best Fit Functions:

AI is all about finding a best fit function for any set of data. We saw in earlier article that for Logistic Regression, sigmoid sunction is a good function for best fit. However there is nothing special about a sigmoid function. From Fourier theorem, we know that a sum of sine/cosine functions can represent any function f(x) (with some limitations, but we'll ignore those). In fact, any function f(x) can be represented as infinite summation of polynomials of x (again with some limitations, but we'll ignore those). Sine/Cosine functions can also be represented as infinite summation of polynomials of x, so they are also able to represent any function f(x). Since any function can be rep as polynomial of x (Taylor's theorem), that implies that any function f(x) can be represented as summation of any other function g(x) that can be represented as infinite summation of polynomials. .

What about functions g(x) that are not infinite summation of x. Let's say g(x)=4+2*x. Will g(x) be able to represent any function f(x)? Since any func f(x) is infinite summation of polynomials, it can be approximated as finite sum of polynomials too. Of course, lower the number of polynomials terms we have in summation of f(x), less will be the accuracy in representing x. Let's see this with an example:

ex: f(x) = 3 + 7*x + 4*x^2 + 9*x^3 + .....

If g(x) = 4+2*x, then we can write f(x) = A*g(x). If we choose A=3, then then 3*g(x)=12+6*x, which is able to approximate f(x) though not exactly. Not only the higher powers of x are missing, but even the 1st 2 terms for f(x) don't match exactly with A*g(x). No matter how many linear combination of g(x) we use, we can't match the 1st 2 terms of f(x).

i.e f(x) = A1*g(x) + A2*g(x) = (A1+A2)*g(x), which is the same as B*g(x). So, we don't achieve anything better by summing the same function g(x) with different coefficients.

However, if we define 2 linear functions, g1(x) and g2(x), where g1(x)=4+2*x, while g2(x)=1+3*x, then A1*g1(x) + A2*g2(x) can be made to represent 3+7*x, by choosing A1=1/5, A2=11/5. Thus we are able to match 1st 2 terms of f(x) exactly.

However, if we had flexibility in choosing g(x), then we would choose g(x)=3+7*x. Then the 1st 2 terms of f(x) would match exactly with g(x), by using just 1 func g(x)

Similarly, if g(x) is chosen to be 2nd degree polynomial, i.e g(x)=1+2*x+3*x^2, then we can choose g1(x), g2(x), g3(x) to be 3 different 2nd degree poly eqn, and approximate f(x)=A1*g1(x) + A2*g2(x) + A3*g3(x). Or if we had flexibility in choosing g(x), then we would choose g(x)=3+7*x+4*x^2. Then the 1st 3 terms of f(x) would match exactly with g(x).

Continuing the same way, higher the order of g(x), closer will the approximation of f(x) with linear summation of any function g(x).

X as a multi dimensional vector:

Now let's consider eqn in n dimension, where f(x) is not a eqn in single var "x", but in "n" var x1,x2,...xn. i.e we define f(X) where X=(x1 x2 x3 .... xn).

Let's stick to 1st degree linear eqn g(x)=m*x+c. We define g1(x1)=m1*x1+c1, g2(x2)=m2*x2+c2, .... gn(xn)=mn*xn+cn

Then f(x1,x2,...,xn) = g1(x1)+g2(x2)+...+gn(xn) = m1*x1+c1 + m2*x2+c2 + .... mn*xn+cn = m1*x1 + m2*x2 + ... mn*xn + (c1 + c2 + ... + cn) = m1*x1 + m2*x2 + ... mn*xn + b (where b = c1+c2+...+cn)

So, for n dimensional space, if we choose g(x1,x2,...,xn) = m1*x1 + m2*x2 + ... mn*xn + b, then we can get a best fit n dimensional plane to function f(x1,x2,...,xn). However, the approximation function is 1st degree polynomial, so it doesn't have any curves or bends (just flat plane). This is a linear function.

Linear function with bendings:

What if we are able to introduce a bend in linear function g(x), so that it's not a straight line anymore. If we then add up these functions with bends, we can have any kind of bend desired at any point. Then we may be able to approximate any function f(x) with these function g(x) by having a lot of these g(x) functions with bends.

Let's see this in 3D, since multidimensional is difficult to visualize. We write above f(x) in 2D as:

f(x,y)=m1*x + m2*y = 2*x+5*y

gnuplot> splot 2*x+5*y => As seen below, this plot is a plane

 

Now, we take a simple function called absolute function. It has a bend, and slopes of 2 lines for x<0 and x>0 are -ve of each other.

gnuplot> splot abs(x) => As seen below, this plot has a bend at x=0

 

 

Now, we plot the same function as first one, but this time with abs functions applied to x and y. As you can see, we have bends so that we can generate planes at different angles to fit complex curves.

gnuplot> splot 2*abs(x)+5*abs(y)

 

 Is abs() function linear? It does look linear, but it has a bend (so 2 linear functions in 2 range).

Let's pick 2 points: x1=1 and x2=-1. Then abs(x1+x2) = abs(1-1) = abs(0) = 0. However, if we compute f(x1) and f(x2), we get f(-1)=1, and abs(1)=1. So. abs(x1)+abs(x2) = 2 which is not same as abs(x1+x2). So, abs() function is not linear. Similarly any 1st order eqn with a bend is not linear.

Taylor theorem tells us that any function can be expanded into infinite polynomial series. We should be able to find Taylor series for abs(x) function.

Note: f(x) = abs(x) = √(x^2) = √(1+(x^2 - 1)) = √(1+t) where t = (x^2 - 1)

√(1+t) is a binomial series which can be expanded into Taylor series as explained here: https://en.wikipedia.org/wiki/Binomial_series#Convergence

(1+t)^1/2 = 1 + (1/2)t - (1/(2*4))t^2 + ((1*3)/(2*4*6))t^3 - ...

So, f(x) = abs(x) = 1 + (x^2-1)/2 - (1/(2*4))(x^2-1)^2 + ((1*3)/(2*4*6))(x^2-1)^3 - ... = [1-1/2-1/(2*4)-...] + x^2*[1/2+1/4+...] + x^4*[-1/(2*4)+....] + x^6*[...] + ...

Thus we see that we get Taylor series expansion of abs(x) as summation of even powers of x. So, it is indeed not a linear eqn. As it's infinite summation, it can be used to represent any function as explained at top of this article.

ReLU function:

Just as absolute func has a bend and is not linear, many other linear looking functions can be formed which have a bend, but are not linear. One such function that is very popular in AI is ReLU (Rectified linear unit). Here instead of having slope as -1 for x<0, we make the slope=0 for x<0. This function is defined as below:

Relu(x) = x for x>0, = 0 for x<0

gnuplot> f2(x)=(x>0) ? x : 0 #this is the eqn to get a ReLu func in gnuplot
gnuplot> splot f2(x)

The above plot looks similar to how abs(x) function looked like, except that it's 0 for all x <0.

Now, let's plot a function which is a difference of the 2 Relu plots.

gnuplot> splot f2(x+5)-f2(x-5)

The Relu plot above ( Relu(x+5) - Relu(x-5) ) now has 2 knees at x=-5 and x=+5. It actually resembles a sigmoid function (explained below). However, it doesn't have smooth edges as in sigmoid func. Since sigmoid function can fit any func, linear sum of Relu func can also fit any func. The advantage with Relu is that it's similar to linear (it's linear in 2 separate regions, although it's not linear overall), so derivatives are straight forward.

There is very good link here on why Relu functions work so well in curve fitting (and how are they non-linear inspite of giving an impression of a linear eqn):

https://towardsdatascience.com/if-rectified-linear-units-are-linear-how-do-they-add-nonlinearity-40247d3e4792

 

Sigmoid function:

Sigmoid function being an exponential function, it's has higher powers of x in it's expansion, instead of just having "x" (i.e x, x^2, x^3, etc).

i.e σ(z) = 1 / (1 - e^(-z)) = A1 + A2*z + A3*z^2 + ... (taylor expansion)

Sigmoid function would fit better than Relu functions above as they have higher orders of x (so they have smooth edges). However, the are also more compute intensive, and so are not used except when absolutely necessary.

Let's plot a 2D sigmoid funcion, where z=a*x+b*y. We use gnuplot to plot the functions below:

 f1(x,y,a,b)=1/(1+exp(-(a*x+b*y)))

Plot 0:

gnuplot> splot f1(x,y,1,4) => As seen below, this is a smooth function varying from 0 to 1. Looks kind of similar to difference of Relu function plotted above.

Plot 1:

gnuplot> splot f1(x,y,2,1) => As seen below, plot is same as that above, except that the slope direction is different

Plot 2:

gnuplot> splot (2*f1(x,y,1,4) + 4*f1(x,y,2,1)) => Here we multiply the above 2 plots by different weights and add them up. So, resulting plot is no longer b/w 0 and 1, but varies from 0 to 6.

 

We define another sigmoid function, which is in 1 dimension

 g(x) = 1/(1+exp(-(x)))

Plot 3:

gnuplot> splot g(2*f1(x,y,1,4) + 4*f1(x,y,2,1)) => Here we took sigmoid of above plot, so resulting plot is confined to be b/w 0 and 1. However, because of the weights we chose, resulting plot ranges from 0.5 to 1, instead of ranging from 0 to 1.

Plot 4:

gnuplot> splot g(-2*f1(x,y,1,4) + 4*f1(x,y,2,1)) => almost same plot as above, except that z range here is from 0.1 to 1 (by changing weight to -ve number)

 

Summary:

 Now, that we know Relu and sigmoid functions are not linear, and in fact are polynomials of higher degree. As such, they can be used to represent any function, by using enough linear combinations of these functions. So, they can be used as fitting functions to fit any n dimensional function. These are used very frequently in AI to fit our training data. We will look at their implementation in AI section.

Course 1 - week 4 - Deep Neural Network:

This is week 4 of Course 1. Here we generalize NN from 1 hidden layer to any number of hidden layers. Maths get complicated, but it's repeating the same thing as in 2 Layer NN. 2 layer NN has one hidden layer and 1 output layer. L layer neural network has (L-1) hidden layers and 1 output layer. We don't count input layer in the number of layers.

There are few formulas here for forward and backward propagation. these form the backbone of DNN. These formula are summarized here:

https://www.coursera.org/learn/neural-networks-deep-learning/supplement/E79Uh/clarification-about-what-does-this-have-to-do-with-the-brain-video

There is very good derivation of all these equations here:

https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60

There are 2 programming assignments in this week: First we build a 2 layer NN and then a L layer NN to predict cat vs non-cat in given pictures. 2 Layer NN is just a repeat from last week's exercise, while L layer NN is generalization of 2 layer NN.

Programming Assignment 1: Here we build helper functions to help build a deep NN.  We also build helper function for a 2 layer NN separately.

Here's the link to pgm assigment:

Building_your_Deep_Neural_Network_Step_by_Step_v8a.html

This project has 3 python pgm, that we need to understand.

A. testCases_v4a.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned on.

testCases_v4a.py

B. dnn_utils_v2.py => this is a pgm that defines couple of functions. 

dnn_utils_v2.py

These functions are:

  • sigmoid(): This calculates sigmoid for a given Z (Z can be scalar or an array). Output returned is both A (which is sigmoid of Z), and cache (which is same as i/p Z)
  • sigmoid_backward(): This calculates dZ given dA and Z. dZ = dA*σ(Z)*[1-σ(Z)]. We stored Z in cache (in sigmoid() above)
  • relu(): This calculates relu for a given Z (Z can be scalar or an array). Output returned is both A (which is relu of Z), and cache (which is same as i/p Z)
  • relu_backward(): This calculates dZ given dA and Z. dZ = dA for A>0 else dZ=0. We stored Z in cache (in relu() above)

We'll import this file in our main pgm.

C. test_cr1_wk4_ex1.py => This pgm just defines the helper functions that we'll call in our 2 layer and L layer NN model that we define in assignment 2. Below is the whole pgm:

test_cr1_wk4_ex1.py

Below are the functions defined in our pgm:

  • initialize_parameters() => This function exactly same as previous week's function for 2 Layer NN. Input to func is size of i/p layer, hidden layer and output layer. It initializes W1,b1 and W2,b2 arrays. W1, W2 are init with random values (Very important to have random values instead of 0), while b1,b2 are init to 0. It puts these 4 arrays in dictionary "parameters" and returns that. NOTE: To be succinct, we will use w,b to mean W1,b1,W2,b2, going forward.
  • initialize_parameters_deep() =>This initializes w,b for L layer NN (same as for 2 layer NN, but extended to L laeys). i/p is an array containing sizes of all the layers, while o/p is initialized W1,b1, W2, b2, .... WL,bL for L layer NN. All weights are bias are stored in dictionary "parameters"
  • Forward functions: These are functions for forward computation:
    • linear_forward() => It computes output Z, given i/p A, W, b. Z = np.dot(W,A)+b. this is calculated for a single layer, using i/p A (which is the o/p from previous layer) and computing Z. It returns Z and linear_cache which is a tuple containing (Aprev,W,b), where Aprev is for previous layer, while W, b are for current layer.
    • linear_activation_forward() => This computes activation A for Z that we calculated above for layer "l". The reason we separated out the 2 functions for computing Z and A, is because A requires 2 diff functions, sigmoid or relu for computing A (depending on which one we want to use for current layer. sigmoid is used for output layer, while relu is used for all other layers). This keeps code clean.
      • We call following functions:
        • linear_forward() => returns Z, linear_cache
        • sigmoid() => returns A, activation_cache
        • relu() => returns A, activation_cache
      • We store all relevant values in tuple cache:
        • linear_cache => stores tuple (Aprev,W,b), where Aprev is for previous layer, while W, b are for current layer.
        • activation_cache => stores computed Z for current layer
        • cache => stores tuple (linear_cache, activation_cache) = (Aprev, W, b, Z). In previous week example, we used cache to store A, Z for both layers (A1, Z1, A2, Z2), but here we store W, b too for each layer on top of A (for previous layer) and Z (for current layer) in tuple cache.
      • The function finally returns A for current layer and cache. So, we end up returning (Aprev, W, b, Z, A), where Aprev is for previous layer, while W, b, Z, A are for current layer.
    • L_model_forward() => This function does forward computation staring from i/p X, and generating o/p Y hat  (i.e output AL for last layer L). This is same as forward_propagation() function that we used in last week's example. It's just more complicated now, since it involves L layers now, instead of having just 2 layers. We define tuple "caches", which is just all cache appended.
      • From layer 1 to layer (L-1) (hidden layers), we call function linear_activation_forward()  with "Relu" function in a for loop (L-1) times
        • In each loop, cache and A are returned for that layer. A is used in next iteration, while cache is appened to tuple "caches"
      • For last layer L (o/p layer), we again call function linear_activation_forward(), but this time with "sigmoid" function
        • cache and AL are returned for last layer. AL is going to be used in compute_cost() function (defined below), while cache is appened to tuple "caches"
  • compute_cost() => computes cost (which is the log function of AL,Y).
  • Backward functions: These are functions for forward computation. They are the same as their forward conterpart, just going backward from layer L to layer 1.
    • linear_backward() => This is the backward counterpart of linear_forward() func. Given i/p cache and dZ for a given layer, it computes gradients dW, db, dA. Input cache stores tuple (Aprev, W, b). NOTE: dW computation requires A from previous layer
      •     A_prev, W, b = cache 
      •     dW = 1/m * np.dot(dZ,A_prev.T)
      •     db = 1/m * np.sum(dZ,axis=1,keepdims=True)
      •     dA_prev = np.dot(W.T,dZ)
    • linear_activation_backward() => This is the backward counterpart of linear_activation_forward() func. Instead of computing A from Z, this computes dA for previous layer given dA (from which dZ is computed) for current layer.
      • We call following functions (same as what used in linear_activation_forward(), but now in backward dirn):
        • sigmoid_backward() => returns dZ given dA for sigmoid func
        • relu_backward() => returns dZ given dA for relu func
        • linear_backward()=> using dZ returned by sigmoid/relu backward func above, it computes dA_prev, which is dA for previous layer (since we are going in reverse dirn)
      • The function finally returns dA for previous layer and dW, db for current layer.
    • L_model_backward() => This is the backward counterpart of L_model_forward(). This function does backward computation staring from o/p Y hat  (i.e output AL for last layer L) and going all the way to the input X. It returns dictionary "grads" containing dW, db, dA.
      • dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
      • dA{L-1}, dWL, dbL => computed using func linear_activation_backward() for layer L. Uses dAL from above as i/p to this func
      • Now, we run a loop from layer L-1 to layer 1 to compute dA, dW, db for each layer "l"
        • dA{l-1}, dWl, dbl => computed using func linear_activation_backward() for layer "l". Uses dAl from prev iteration as i/p to this func to compute dA{l-1}. It uses dA{L-1} from above for l=L-1 to compute da{L-2} and then keeps on iterating backward.
      • Finally it returns dictionar grads containing dW, db, dA for each layer
  • update_parameters() => This function is same as that in previous week exercise. computes new w,b given old w,b and dw,db, using the learning rate provided. This is done for w,b for all layers 1 to L (i.e W1=W1-learning_rate*dW1, b1=db1-learning_rate*dW1, .... , WL=WL-learning_rate*dWL, bL=dbL-learning_rate*dWL)
  • 2_layer_model()/L_layer_model => These are the main func, but they are not called here. They are part of assignment 2.

 

Programming Assignment 2: Here we use helper functions defined above in assignment 1 to help build a 2 Layer shallow NN and a L layer deep NN. We find optimal weights using training data and then apply those weights on test data to predict whether the picture has a cat or not.

Here's the link to pgm assigment:

Deep+Neural+Network+-+Application+v8.html

This project has 2 python pgm, that we need to understand.

A. dnn_app_utils_v3.py => this is a pgm that defines all the functions that we defined in assignment 1 above (both from dnn_utils_v2.py and test_cr1_wk4_ex1.py). So, either we can use our functions from assignment 1 or use functions in here. If you wrote all functions in assignment 1 correctly, then it should match all functions in this pgm below (except for few difference noted below).

dnn_app_utils_v3.py

The few differences to note in above pgm are:

  • load_data() function: This function is extra here. It is exactly same as load_dataset() that we used in week2 assignment to load cat vs no cat dataset. Here too we load the same cat vs no cat dataset that's in h5 file.
  • predict(): This prints accuracy for any i/p set X (which can have multiple pictures in it). It uses w,b and generates output y hat for the given X. If y hat > 0.5, it predicts it as cat, else non cat. It then compares the results to actual y values, and prints accuracy. It only calls 1 function=> L_model_forward. In returns probability array "p" for all pictures. I added extra var "probas" (which is the output value y hat), so that we can how close or far off were different predictions, even if they were correct or wrong. This gives us a sense of how our algorithm is doing.
  • print_mislabeled_images(): This takes as i/p dataset X,Y along with predicted Y hat, and plots all images that aren't same as what was predicted (i.e wrongly classified)
  • IMPORTANT: initialize_parameters_deep() function: This function is same as what we wrote in assignment 1 above, with a subtle difference. Here we use a different number to initialize w. Instead of multiplying the random number by 0.01, we multiply it by 1/ np.sqrt(layer_dims[l-1]) for a given layer l. As you will see, this causes a lot of difference in getting the cost low. With 0.01, our cost starts at 0.693148, and remains at 0.643985 at 2400 iteration. Accuracy for training data remains low at 0.65. However, using the new sqrt multiplier, our cost starts at 0.771749, and goes down to 0.092878 at iteration 2400, giving us a training data accuracy of 0.98.

We'll import this file in our main pgm below

B. test_cr1_wk4_ex2.py => This pgm calls functions in dnn_app_utils_v3.py.  Here, we define our algorithm for 2 layer NN and L layer NN by calling functions defined above. We find optimal weights, by trying out algorithm on training data.. We then apply those weights on test data to see how well our NN predicts cat vs non cat. Below is the whole pgm:

test_cr1_wk4_ex2.py

Below are the functions defined in our pgm:

  • two_layer_model() => This function implements a 2 layer NN. It is mostly same as previous week's function for 2 Layer NN which was called nn_model(). The big difference is that we used tanh() function for hidden layer, while here we'll use relu function for hidden layer. Input to func is size of i/p layer, hidden layer and output layer. On top of that we provide i/p dataset X, o/p dataset Y and a learning rate. The function returns optimal W1,b1,W2,b2., These are the steps in this function:
    • calls func initialize_parameters() to init w,b
    • It then iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls these functions:
      • linear_activation_forward() => Given values of X,W1,b1, it calls func linear_activation_forward()  with relu to get A1. It then calls linear_activation_forward()  again with A1,W2,b2 and sigmoid to get A2. computes A2(i.e Y hat). It returns A2 and cache.
      • compute_cost() => Given A2,Y,  it computes cost
      • Then it calc initial back propagation for dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
      • linear_activation_backward => Given dA2 and cache, it calls linear_activation_backward to get dA1, dW2, db2. It then calls linear_activation_backward()  again with dA1 and cache to get dA0, dW1,db1. It stores dW1,db1,dW2,db2 in dictionary grads.
      • update_parameters() => This computes new values of parameters using old parameters and gradients from grads.
    • In beginning, w and b are initialized. We start the loop and in first iteration, we run the 4 functions listed above to get new w,b based on dw, db, and learning rate chosen. Then we start with next iteration. In next iteration, we repeat the process with newly computed values of w,b fed into the 4 functions to get even newer dw, db, and update w,b. We keep on repeating this process for "num_iterations", until we get optimal w,b which hopefully give lot lower cost than what we started with.
    • It then returns dictionary "parameters" containing optimal W1,b1,W2,b2
  • L_layer_model() => This function implements a L layer NN. It's just an extension of 2 layer NN. Input to func is size of i/p layer, hidden layer and output layer. On top of that we provide i/p dataset X, o/p dataset Y and a learning rate. The function returns optimal W1,b1,...,WL,bL., These are the steps in this function:
    • calls func initialize_parameters_deep() to init w,b
    • It then iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls these functions:
      • L_model_forward() => Given values of X,W1,b1, it calls func linear_activation_forward()  with relu to get A1. It then calls linear_activation_forward()  again with A1,W2,b2 and sigmoid to get A2. computes A2(i.e Y hat). It returns A2 and cache.
      • compute_cost() => Given A2,Y,  it computes cost
      • Then it calc initial back propagation for dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2))
      • L_model_backward => Given dA2 and cache, it calls linear_activation_backward to get dA1, dW2, db2. It then calls linear_activation_backward()  again with dA1 and cache to get dA0, dW1,db1. It stores dW1,db1,dW2,db2 in dictionary grads.
      • update_parameters() => This computes new values of parameters using old parameters and gradients from grads.
    • In beginning, w and b are initialized. We start the loop and in first iteration, we run the 4 functions listed above to get new w,b based on dw, db, and learning rate chosen. Then we start with next iteration. In next iteration, we repeat the process with newly computed values of w,b fed into the 4 functions to get even newer dw, db, and update w,b. We keep on repeating this process for "num_iterations", until we get optimal w,b which hopefully give lot lower cost than what we started with.
    • It then returns dictionary "parameters" containing optimal W1,b1,...,WL,bL

Below is the explanation of main code (after we have defined our functions as above):

  1. We load our datset X,Y by using func load_data(). We then flatten X and normalize X (by dividing it by 255)
  2. We then run 2 NN on our data: 1 is 2 layer NN, while other is L layer NN. We can choose which one to run by setting appr variable. Size of i/p layer for both examples below is fixed to 12288 (64*64*3 which is the total number of data points associated with 1 picture). Size of o/p layer is fixed to 1 (since our o/p contains just 1 entry: 0 or 1 for cat vs non cat). Size of hidden layers is what we can play with, since it can be varied to any number we want.
    1. 2 layer NN:
      1. We call two_layer_model()  on this X,Y training dataset. We give dim of i/p layer, hidden layer and output layer, and set num of iterations to 2500. Hidden layer size is set to 7.
      2. Then we call  predict() to print accuracy on both training data and test data which is pretty low as expected.
      3. Then we print mislabeled images by calling func print_mislabeled_images.
    2. L layer NN:
      1. We call function L_layer_model() with i/p X,Y training dataset and number of hidden layers set to 3 (So, it's a 4 layer NN).
      2. Then we call  predict() to print accuracy of L NN on both training data and test data,  which is lot higher than 2 layer NN.
      3. Then we print mislabeled images by calling func print_mislabeled_images.
  3. Then we run the NN (2 Layer or L layer depending on which one is chosen) on our 10 picture dataset that I downloaded from internet (same as what we used in lecture 1, week 2 example). These are all cat pictures. In predict(), we return "y hat" also, so we are able to see all predicted values.

Results:

On running above pgm, we see these results:

2 layer NN: It achieves 99.9% accuracy on training data, but only 72% on test data.

 Cost after iteration 0: 0.693049735659989

...

Cost after iteration 2400: 0.048554785628770226
Accuracy: 0.9999999999999998
Accuracy: 0.72

When I run it thru my 10 random cat pictures downloaded from internet, I get 90% accuracy. Below are the A (y hat) value and the final predicted value . As can be seen, accuracy is very low at 60%.  Even for ones that were predicted correctly, y hat activation values are not 99% for all correct ones.

Accuracy: 0.6
prediction A [[0.2258492  0.88753723 0.04103057 0.97642935 0.87401607 0.85904489 0.49342905 0.99138362 0.96587573 0.3834667 ]]
prediction Y [[0. 1. 0. 1. 1. 1. 0. 1. 1. 0.]]

4 layer NN: It achieves 99% accuracy on training data and 80% accuracy on test data. For 1st layer, size=20, 2nd layer size=7, 3rd layer size=5 and 4th layer size=1 (since it's o/p layer). size of i/p layer is 12288.

Cost after iteration 0: 0.771749

.........

Cost after iteration 2400: 0.092878
Accuracy: 0.9856459330143539
Accuracy: 0.8

 As in 2 Layer NN, when we run 4 layer NN thru the same 10 random cat pictures, I get 90% accuracy which is lot higher than 2 layer NN. Below are the A (y hat) value and the final predicted value . As can be seen, even though accuracy is 90%, the algorthm completely failed for picture 10 which is reported as 0.2, even though it's a perfect cat picture (may be the background color made all the difference. Will need to check it with different background color to see if it makes any difference). The other picture that is right on borderline is the 6th picture. Here, may be too much background noise (things around the cat) is causing the issue. Will need to check with different background to see if that helps.

Accuracy: 0.9

prediction A [[0.99930858 0.97634997 0.96640157 0.9999905  0.95379876 0.5026841 0.92857836 0.99693246 0.99285584 0.21739979]]
prediction Y [[1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]]

Initialization of w,b: If we used the initialization multiplying factor of 0.01 instead of 1 / np.sqrt(layer_dims[l-1]), we'll get a very bad accuracy: 65% on training set and 34% on test set. even worse is the fact that on our 10 random cat images, we get 0% accuracy. This all with just using a different initialization number for different layers. Perhaps this will be explored in next lecture series.

This is what initialization multiplying factor is for different layers (instead of using constant 0.01 for all layers, we increase this value as size of layer increases):

l= 1 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(12288) = 0.009

l=2 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(20) = 0.22

l=3 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(7) = 0.38

l=4 => 1 / np.sqrt(layer_dims[l-1]) = 1 / np.sqrt(5) = 0.45

NOTE: Do NOT forget to change this multiplying factor of 0.01 if you plan to use your own functions from assignment 1 above.

Summary:

Here we build a 2 layer NN (with 1 hidden layer) as well as a L layer NN (with L-1 hidden layers). We can play around with lot of parameters here to see if our L layer NN (here we chose L=4) performs better with more layers, or more hidden units in each layer, or with a different initialized values, or different learning rates, etc. It's hard to say which of these values will give us the optimal results without trying it out. This will be the topic for Course 2 series.

 

all_* cmds:

These all_* cmds return collection of objects of that type. Various options provide what attr we want for that object collection.

 



all_clocks => creates collection of all clcoks. No args for this cmd. ex: all_clocks => returns all clks in design

 



all_inputs => Creates a collection of all input ports.

ex: all_inputs -clock CLK1 => returns only those i/p ports that are clocked by CLK1

 



all_outputs => Creates a collection of all output ports.

ex: all_outputs -clock CLK1 => returns only those o/p ports that are clocked by CLK1

 



all_registers => Creates a collection of register (Flip flop or latches) cells or pins. This is very useful cmd to trace all flops/latches fired by a particular clk. Particularly helpful during clk tree debug, as it shows only the sink endpoints. The only endpoint that is missing from this collection is any output port connected to the clock. Lots of arguments possible.

syntax: all_registers <options>

options:

  • -clock <clk_name> => only returns reg clocked by given clk. We can provide only 1 clk name here as providing multiple clks will error out (SEL-006 Error). To see only flops or latches, use option <-edge_triggered | -level_sensitive>. For flops, we may also specify -rise_clock <clk_name> or -fall_clock <clk_name> to see only flops which are triggered by either rising or falling edge of given clk.
  • <-clock_pins | data_pins | -output_pins | -async_pins> By default, reg name shown (or use option -cells), but we can report corresponding pins of cells by specifying these options. -clock_pins option most useful.
  • <-no_hierarchy> => Considers only the current instance; does not descend the hierarchy. This is useful to isolate regs in different modules

ex: all_registers -clock CLK1 => -clock returns only those reg those are clocked by CLK1. Without -clock option, all regs shown (irrespective of whether they are clocked in design or not), but by adding -clock CLK1, only regs shown which are actively driven by CLK1 (i.e not disabled or tied off). This resulting collection can be passed thru foreach collection loop.

 



all_clocks => Creates a collection of all clocks in design. Fast and easy way to see all clks. No options supported. We generally use  get_clcoks cmd to get clocks. See in "clk cmd" section for details.

 


 

all_fanin => Creates a collection of pins, ports, or cells in the timing fanin of specified objects (pins, ports or cells), specified via -to. The fanin stops at the timing startpoints (clock pins of registers or PI). To see only the startpoint and not the whole path, we use option "-startpoints_only". Since most of the times we are not interested in the whole path, but just at startpoints, we use option "-startpoints_only".There are many other options as follows:

 

  • -from/-through can be used to restrict the fanin thru speciified pins, ports or cells.
  • -only_cells includes cells only (and not pins/ports) in timing fanin.
  • -flat should be used to traverse fanin across hier, else by default fanin doesn't cross hier.
  • -trace_arcs may be used to control what kind of combinational arcs to trace. By default (or -trace_arcs timing), only valid timing arcs are traversed (disabled arcs + invalid case analysis arcs not traversed) , but by using "-trace_arcs enabled", invalid case analysis paths are also traced (disabled arcs are still not traced). By using "-trace_arcs all", both disabled arcs as well as invalid case analysis paths are traced.
  • -levels allows us to stop traversal on reaching a depth of certain vells from the object in -to list. So, "-levels 1" will go only 1 level deep. This allows us to see paths one depth at a time.
  • -continue_trace generated_clock_source => This option is very useful for traversing clock network paths, as it allows tracing thru the source pin of generated clocks, instead of stoppping at seq pin of gen clk source. In most cases, you will want to use this option.
    • IMP: For a clk gater cell, the generated clk is sometimes defined at o/p pin or clk pin of clk gater. This is done in cases where we want the generated clk to be defined as async to the parent clk (as an example, bist clks are defined on the ck gaters, and then bist clks declared async to func clk). In such cases, all_fanin will stop at generated clk pin, as that's a timing startpoint. If we define generated clk on Q pin, and do all_fanin, then fanin will stop at Q pin. There will be no fanin from Q to CP pin, unless we use this option (or we define gen clk on CP pin).

ex: all_fanin -flat -startpoints_only -to mod1/..reg_2/D => shows startpoints only (not whole path) of all fanin to the D pin of this reg. Startpoints may be PI or clk pin of other flops or Q pin of clkgaters.

 

report_transitive_fanin => This is a reporting cmd, but is included here since it's very similar to all_fanin. Produces report showing transitive fanin (not timing fanins in all_fanin) of specified objects (pins, ports or nets), specified via -to. We can provide -from/-through to constrain the fanin. A pin is considered to be in the transitive fanin of an object if there is a timing path through combinational logic from the pin to that object. So, not sure how it's different than timing fanin of all_fanin cmd. We can use -trace_arcs option as in all_fanin cmd. The fanin stops at the clock pins of registers or PI. Fanin is provided within the current instance, so if we want to see all fanin, current instance should be set to top module. NOTE: this is reporting cmd, so can't be used in scripts (as it doesn't o/p a collection)

ex: report_transitive_fanin -to FF1/D => Shows driver of i/ pin of flop (FF1/D pin), then the driver of i/p pins of this driver and so on until it gets to PI or clock pins of reg.

 


 

all_fanout => same as all_fanin except that it reports objects in timing fanout. Here, -from specifies objects whose fanout we want (for fanin, we used -to). The fanout stops at timing endpoint (D or other i/p pin of registers or PO). Again option "-endpoints_only" may be used to report only endpoints, instead of the whole path. There's -clock_tree option to constrain the search to objects in clock network only (-clock_tree and -from are exclusive, only one of them can be used). All other options are the same as all_fanin.

ex: all_fanout -flat  -endpoints -from mod1/..or2/Y => shows endpoints only for all fanout from Y pin of this OR gate.

report_transitive_fanout =>this is similar to report_transitive_fanin, except that it gives fanout report. However, there is an addition option "-clock_tree" as in all_fanout.

ex: report_transitive_fanout -from FF1/Q => Shows load of o/p pin of flop, then the driver pin of that load, and the load connected to that pin and so on.

 


 

all_connected => Creates a collection of objects connected to a speciifed net, pin, or port object, or a collection of exactly 1 net, pin or port object. -leaf option when used with a net returns global or leaf pins. This very useful to see all the objects connected to a given net, and then trace thru a given path.

ex: all_connected [get_nets CLOCK] => shows all objects connected to net "CLOCK"

Get all connected pins of a net: There are 2 ways: 1 shown under "PT - object access functions" section. That uses "get_pins -of_objects ... -leaf" and the other is  "all_connected ... -leaf".

  • ex: all_connected  mod1/IO_port6 -leaf => Here it shows all leaf pins of gates connected to this port of module (ports of modules are actually pins, since ports are only for top level)
    {"mod2/GATE_and2_0/ZN", "mod1/mod3/I_OR3/A", "mod4/I_DFF/CP"}

 


 

NOTE: Above 3 cmds along with report_transitive_fanin/report_transitive_fanout are used to debug and trace timing paths, when we want to see the logic structure. We can see logic structure by bringing up gate level schematic of the netlist in any other tool (such as Verdi), but advantage here is that it has the ability to show only valid timing paths after accounting for inactive case_analysis and disabled timing arcs. This helps to find out where case_analysis may not have been set correctly, or why some timing path abruptly ends.