2.3 - Hyperparameter tuning, Batch Normalization and Programming Frameworks

Details: Last Updated: Friday, 24 March 2023 13:00; Published: Friday, 20 November 2020 18:53; Hits: 795

Course 2 - week 3 - Hyperparameter tuning, Batch Normalization and Programming Frameworks

This week's course is divided into various sections. The 1st 2 sections are continuation of previous week material. That last section is about using Programming frameworks, whcih is totally new and will require some time to undersatnd it.

Hyperparameter tuning:

There are various hyper parameters that we saw in previous section, that require to be tuned for our NN. These parameters in terms of their effect on NN performance are:

1. learning rate (alpha): Most important hyper parameter to tune. Not choosing this value properly may cause large oscillations in optimal cost function.

2. Mini batch size, number of hidden units and momentum (beta): These are second in importance.

3. Number of layers (L), learning rate decay, Adam parameters (beta1, beta2, epsilon): These are last in importance. Adam hyperparameters (beta1=0.9, beta2=0.999, epsilon=10^-8) are usually not tuned, as these values work well in practice.

It's hard to know in advance what hyper parameter values will work, so we try random values of these hyper parameters from within a bounding box (may be changing 2 at a time or 3 or even more at a time). Once we find a smaller bounding box, where hyper parameters seemed to perform better, we use "coarse to fine" technique to start trying finer values, until we get close to optimal hyperparameters.

We need to choose the scale of where to sweep the hyper parameters very carefully, so that we cover the whole range. For ex, to sweep learning rate alpha, we sweep hyperparameter on log scale from 0.0001 (10^-4) to 1 (10^0) on a log scale in steps of x10 (i.e 10^-4, then 10^-3, then 10^-2, then 10^-1 and finally 10^0).

There are 2 approaches to hyper parameter tuning:

1. caviar approach: We use this if we have many computing resources available. We train many NN in parallel with different hyper parameter settings, and see which ones work. Caviar is a fish, and how they care for their babies, is to have too many of them, and just let the best ones survive.

2. panda approach: Here, we just run one NN model with a set of hyperparameters. However as time pass, we tune hyperparameters and see if they are making the performance of NN better or worse, and keep on adjusting hyper parameters every day or so. So, here we babysit just 1 model, similar to how panda do with their babies. They don't produce too many babies, but keep on watching their one baby with all effort and making them stronger.

Batch Normalization:

Here, we normalize inputs to speed up our NN. We subtract the mean from inputs, and then divide it by their variance (or should be std deviation, since variance is still in square form). That way inputs gets more uniformly distributed around a center, which causes our cost function to be more symmetric, resulting in faster execution when finding minima.

For a deep NN, we can normalize inputs to each layer. Input to each layer is o/p of activation func, a^[l]. However, instead of normalizing a^[l], we normalize Z^[l].

μ = 1/m * Σ Z^[i]

σ² = 1/m * Σ (Z^[i] - μ)²

Z_norm^[i] = (Z^[i] - μ) / √(σ²+ε)

Now instead of using Z^[i] in our previous NN eqn, we use Z_norm^[i] which is the normalized value.

If we want to be more flexible in how we want to use Z^[i], we may define learnable parameters gamma and beta, which allows the model to choose either raw Z^[i], normalized Z_norm^[i] or any other intermediate value of Z^[i]. This can be achieved by defining new var Z˜^[i] (Z tilde)

Z_tilde^[i] = γ*Z_norm^[i] + β => by changing values of gamma and beta, we can get any Z_tilde^[i] . For ex, if γ=1 and β=0, then Z_tilde^[i] = Z_norm^[i]. If γ=√(σ²+ε), and β=μ, then Z_tilde^[i] = Z^[i]

Since gamma and beta are learable parameters (just like weights), we really don't have to worry about the most optimal values of gamma and beta. The gd algo would choose the values that gives the lowest cost for our cost function. Note that each layer has it's own gamma and beta, so they can be treated just like weights for each layer. gd now calculates γ^[i] and β^[l], on top of W^[l] and b^[l]. However since we are normalizing, we will see that b^[l] is cancelled out, so we can omit b^[l]. So, we have 3 parameters to optimize: γ^[i] , β^[l]and W^[l] for each layer l. We can extend this to mini batch technique too, with all gd algorithms like momentum, adam, etc.

Batch norm works because it makes NN computation more immune to covariate shifts. The i/p data and all other intermediate i/p data are always normalized. It ensures that mean and variance of all i/p will remain the same, no matter how the data moves. This makes these values more stable even if i/p shifts.

Multi Class classification:

Binary classification is what we have used so far, which classifies any picture into just 2 outcomes = cat vs non-cat. However, we can have multi class classification, where o/p layer produces multiple outputs, i.e if the picture is cat, dog, cow or horse (known as 4 class classification). It outputs the final probability of each of the classes, and the sum of these probabilities is 1.

Here the o/p layer L, instead of generating 1 o/p, generates multiple o/p values one for each class. So, the o/p layer Z^[L], instead of being a 1x1 matrix in binomial classification, is a Cx1 matrix now for multi class classification, where C is the number of classes to classify. Previously activation function for o/p layer a^[L] used to be sigmoid function which worked well for binomial classification. However, now with multi class classification, we need a different activation function for o/p layer. We choose activation function to be exponent function normalized by sum of exponents.

For 2 class classification,we use sigmoid func:

Sigmoid function σ(z) = 1/(1+e^-z) = e^z/(1+e^z)

prob for being in class 0 = y_hat = σ(z) and

prob for being in class 1 (not in class 0, or class=others) = 1 - y_hat = 1 - σ(z) = 1/(1+e^z)

We generalize above eqn for C classes. We use exponent func in o/p layer (also called as softmax layer)

exponent func = e^zk/(e^z1 + e^z2 + ... e^zc) where C is the number of classes, k is the kth class

prob for being in class 0 = y_hat[0] = e^z1/(e^z1 + e^z2 + ... e^zc)

prob for being in class 1 = y_hat[1] = e^z2/(e^z1 + e^z2 + ... e^zc)

...

prob for being in class C-1 = y_hat[c] = e^zc/(e^z1 + e^z2 + ... e^zc)

So, probabilities all add up to 1. Matrix a^[L] or y_hat is CX1 matrix

For C=2, multiclass reduces to binary classification. For implementation of multi class, the only difference in algo would be to compute o/p layer differently, and then do back prop.

For 2 class, if we choose e^z2=1, then we get

prob for being in class 0 = y_hat[0] = e^z1/(e^z1 + e^z2) = e^z1/(e^z1 + 1)

prob for being in class 1 = y_hat[1] = e^z2/(e^z1 + e^z2) = 1/(e^z1 + 1)

Which is exactly what we got by using our sigmoid function earlier. so, exponent func and sigmoid func in o/p layer give the same result, implying sigmoid was just a special case of exponent func.

NOTE: In binary classification, we had an extra function at o/p which converted y_hat to 0 or 1 depending on if it's value was greater < 0.5 or not. That was called hard max. Here in multi class classification, we don't have that extra function. We just stop once we get the probabilties of each class. This is called softmax.

Logits: In multi class classification, computed vector Z = [Z1, Z2 ... ZC] are called logits. The shape of logits is (C,m) where C=number of classes, m=number of examples

Labels: In multi class classification, given vector Y = [Y1, Y2 ... YC] are called labels. Each Y1, Y2 is 1 hot, so it has C entries, instead of just one entry. The shape of labels is same as that of logits i.e shape = (C,m)

Cost eqn: For multiclass classification, loss function is same as for binary classification with some modification.

Programming Frameworks

Instead of writing all these NN functions ourselves (forward prop, backprop, adam, gd, etc), we have NN frameworks, which provide all these functions to us. Tenosrflow is one such framework. We'll use tensorflow in python for our exercises. You can get introductory material for tensorflow including installation in "python - tensorflow" section. Once you've completed that section, come back here.

Programming Assignment 1: here we have 2 parts. 1st part, we learn basics of tensorflow (TF), while in 2nd part, we build a NN using TF.

Here's the link to pgm assigment:

TensorFlow_Tutorial_v3b.html

This project has 3 python pgm, that we need to understand.

A. tf_utils.py => this is a pgm that defines following functions that are used in our NN model later:

tf_utils.py

load_dataset() => It loads test and training data from h5 files, similar to function used in section "1.2 - Neural Network basics - Assignment 1". The only difference is that Y is now a number from 0 to 5 (6 numbers), instead of being a binary umber - either 0 or 1. This is because we are doing a multi class classification here. Each picture is a sign language picture representing number 0, 1, 2, 3, 4 or 5.
- Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number). X after flattening is 2D vector with shape = (12288, 1080), while Y after flattening is 2D vector with shape = (1, 1080)
  - link to training set file: train_signs.h5
- Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number). X after flattening is 2D vector with shape = (12288, 120), while Y after flattening is 2D vector with shape = (1, 120)
  - link to test set file: test_signs.h5
random_mini_batches() => This creates a lit of random mini batches from set X,Y. These random mini batches are shuffled, and each have a size as specified in argument.
convert_to_one_hot(Y, C) => This returns a 1 hot matrix for given o/p vector Y, and for "C" classes. 1 hot vector is needed for multi class classification.
predict(X, parameters) => Given i/p picture X, and optimized weights, it returns the prediction Y_hat, i.e what number from 0 to 5 is the picture representing
forward_propagation_for_predict(X, parameters) => Implements the forward propagation for the model. It returns Z3, which is the o/p of last linear unit (before it feed into the softmax function to yield a[3])

B. improv_utils.py => this pgm is not used anywhere, so you can ignore it, This is a pgm that defines all the functions that are used in our NN model later. This has all functions that were in tf_utils.py, as well as all the func that are going to be defined in test_cr2_wk3.py. So, this pgm is basically solution to the assignment, as all the functions are written here, that we are going to write in our assignment later. You should not look at this pgm at all, nor should you use it (unless you want to check your work after writing your own functions).

improv_utils.py

C. test_cr2_wk3.py => Below is the whole pgm,

test_cr2_wk3.py

This pgm has 2 parts to it. In 1st part, we just explore TF library, while in 2nd part, we write the actual NN model using TF.

Part 1: This is where we explore TF library. All i/p and o/p of these examples is Tensor Data. NOTE: we don't use any of these functions below in our NN model that we build in part 2. This is just for practise.

comput loss eqn: simple loss eqn value is computed by creating a TF variable for loss.
multiply using constant: multiplying 2 constant numbers and printing result.
multiply using placeholder: Here we feed value into placeholder at runtime, and compute 2*x.
linear_function(): Here we compute Y=W*X+B, where W,X,B and Y are all Tensor vectors (i.e Matrices) of a pre determined shape
sigmoid(z): Given i/p z, compute sigmoid od z.
cost(logits, labels): This computes cost using tf func "tf.nn.sigmoid_cross_entropy_with_logits()". This calculates cost= - ( y*log(sigmoid(z)) + (1-y)*log(1-sigmoid(z)) ). This returns a vector with 1 entry for each logits/label pair. When you have "m" examples for each logit/label, then it computes summation and mean. However in NN model that we build later, we'll be using "tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits())", which works for multiclass classification. This func is explained in Python-TF section.
one_hot_matrix(labels, C): This returns a 1 hot matrix for given labels, and for "C" classes.
ones(shape) => creates a Tensor matrix of given shape, and initializes it with all 1.

Part 2: This is where we build a neural network using tensorflow. Our job here is to identify numbers 0 to 5 from sign language pictures. We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Below are the functions defined in our pgm for part 2:

create_placeholders() => creates placeholders for i/p vector X and o/p vector Y
initialize_parameters() => initializes w,b arrays. W is init with random numbers, while b is init with 0.
forward_propagation(X, parameters) => Given X, w, b, this func calculates Z3 instead of a3 (z3 is the output o last NN layer, which feeds into the softmax(exponent) function)
compute_cost(Z3, Y) => This computes cost (which is the log function of A3,Y). A3 is computed from Z3, and cost is calculated as per loss eqn for softmax func. We use following TF func for computing cost: tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...)). logits=Z3, while labels=Y.
backward propagation and parameter update: This is done using a TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)". This is explained in TF section.The notes talks about TF func "tf.train.GradientDescentOptimizer" but we use "tf.train.AdamOptimizer" for our exercise. This func is instantiated directly within the model, as it's a built in func (not a user defined func that we need to define or write function for)
predict() => Given input picture array X, it predicts Y (i.e whether pic is cat or not). It uses w,b calculated using optimize function. We can provide a set of "n" pictures here in single array X (we don't need to provide each pic individually as an array). This is done for efficiency purpose, as Prof Andrew explains multiple times in his courses.
model() => This is the NN model that will be called in our pgm. We provide both training and test pictures as 2 big arrays as i/p to this func. This model has 2 parts. First it defines functions, and then it runs(calls) them It inside a session. These are the 2 parts:
- Define the functions as shown below:calls above functions as shown below:
  - defines func create_placeholders() for X,Y.
  - defines func initialize_parameters() to init w,b randomly
  - Then it defines forward_propagation() to compute Z3
  - Then it defines compute_cost() to compute total cost given ze and o/p labels Y.
  - then it defines TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" to do backward propagation to update paramer=ters for 1 iterayionptimize() to optimize w,b to give lowest cost across the training set.
  - defines an inti func "tf.global_variables_initializer()". This is needed to init all variables. See in TF section for details
- Now it creates a session, forms a loop, and calls the above functions
  - start the session. Inside the session. run these steps
    - Run the init func defined above => "tf.global_variables_initializer()".
    - Make a loop and iterate below func for "num_of_epoch" times. It's set to 1500. We will change it to 10,000 too and see the impact on accuracy.
      - Form minibatches of X,Y and shuffle them
      - iterate thru each minibatch
        
        call these two func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" and "compute_cost" for each minibatch. Here, we don't need to explicitly call "compute_cost ", since running "minimize(cost)" will call the func compute_cost any way. The reason we do it, is because we want to get the "cost" o/p returned by "compute_cost" func for our plotting of cost vs iterations.
        
        We add the cost from each iteration to "total_cost" and divide it by number of examples, to get avg cost
    - Now plot "total_cost" vs "num f=of iterations". This shows how the cost is going down as we iterate more and more.
    - Now it runs "parameters" node again to get values of parameters. NOTE: that running "parameters" again doesn't run func "initialize_parameters() " again, but instead just returns the computed values for that node
    - It then calls tf functions to calculate prediction and accuracy for all examples in test and training set. Accuracy is then reported for all pictures on what they actually were vs what our pgm predicted.

Below is the explanation of main code (after we have defined our functions as above):

We get our datset X,Y by calling load_dataset().
Next we can enter index of any picture, and it will show the corresponding picture for our training and test set. This is for our own understanding. Once we have seen a few pictures, we can enter "N" and the pgm will continue.
Now we flatten the array returned and normalize it. We also use "one_hot" function to convert labels from one entry to a one hot entry, since our labels need to be one-hot format for our softmax func to work.
Now we call our function model() defined above. We provide X,Y training and testsing arrays (which are not Tensors, but are numpy arrays). We see that these numpy arrays are used as Tensor i/p to many functions above. I guess it still works, as conversion from numpy to Tensors takes place automatically when needed.
In above exercise, we used a 3 Layer NN, with fixed number of hidden untis for each layer. We ran

Results:

On running above pgm, we see these results:

On running the above model with 1500 iterations get a training accuracy of 70%.
- Cost after epoch 0: 1.913693
- Cost after epoch 100: 1.049044 .... => If you get "Cost after epoch 100: 1.698457", that means you are still using GradientDescentOptimizer". Switch to "tf.train.AdamOptimizer".
- Cost after epoch 1400: 0.053902
When we increase the number of iterations to 10,000, our training accuracy goes to 89%. See how cost keeps on going down and then kind of flattens out.
- Cost after epoch 1400: 0.053902 ...
- Cost after epoch 2500: 0.002372
- Cost after epoch 5000: 0.000091
- Cost after epoch 9900: 0.000003

Programming Assignment 2: This is my own programming assignment. It's not part of the lecture series. Here, I took an example from one of the earlier programming assignments, and rewrote it using TF to see if I could do that. It does work, but not sure if everything is working correctly (as the cost is different from previous assignment, and there is no way to verify the accuracy)

test_cr2_wk2_tf.py => Below is the whole pgm, This pgm is copied from course 2 week 2 pgm => course2/week2/test_cr2_wk2.py. We wrote same pgm with tensorflow functions now. We implement it for batch gd only (not other ones)

test_cr2_wk2_tf.py

We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Even though this is a binary classifier, we still used a softmax implementation, as binary is a special case of softmax with number of classes=2. All the functions that we defined in Pgm assignment 1 are the same here. The only diff is in initialize_parameters() func, and model() func. Differences are explained below:

initialize_parameters() => here we allowed args to be passed for "number of hidden units" for each layer, so that we can keep it consistent with our "course 2 week 2 pgm". That also allows us to play around with different number of hidden units for each layer, and observe the impact. The default number of hidden units for 3 layers is: [2, 25, 12, 2] where 1st entry 2 is for input layer.
model() => Here, the definition of functions is same as in assignment 1. These are few differences:
- Test set: Since we have only training set in this example, we don't have arg for "test_set" in model() func. model() func is copied from course2 week 2 and is modified wherever needed to work for TF.
- optimizer: We call "tf.train.GradientDescentOptimizer" instead of Adam Optimizer. We could try both.This is just to keep it consistent with "course 2 week 2" pgm.
- cost_avg: One other diff is that we don't do "cost_avg" by dividing by "m", as we already averages when we divide it by "mini_batch_size" within the loop.
- All other part of model() is same, except that we don't evaluate test accuracy (since there's no test set)

Now we run the main pgm code the same way as in assignment 1. These are the diff:

We load the red/blue dataset (by using diff func load_dataset_rb_dots()). This is needed since the dataset here is different and is created by writing python cmds.
We convert Y label to 1 hot. This is needed for softmax function as explained earlier. We convert Y = [ 0 1 0] into Y(one_hot) = [ [1 0] [0 1] [1 0] ] where 0=red, 1=blue
We now call model() with desired number of hidden units, and it gives us the prediction accuracy.

Results:

This is the result we get (with the default settings we have in our pgm):

Cost after epoch 0: 0.051880
Cost after epoch 1000: 0.038114
Cost after epoch 2000: 0.030764
Cost after epoch 3000: 0.027093
Cost after epoch 4000: 0.025386
Cost after epoch 5000: 0.022814
Cost after epoch 6000: 0.021766
Cost after epoch 7000: 0.021067
Cost after epoch 8000: 0.019954
Cost after epoch 9000: 0.019063
Parameters have been trained!
Train Accuracy: 0.9166667

Summary:

Here we built a 3 layer NN using Tensor Flow. TF is not easy or intuitive, so I'm lost too, on why somethings work with tensors, some with numpy, and what run session does, and s on. But eventually it did work. The main take away is that multi class classification worked just as easily as binary classification, and got us 90% accuracy if trained for long enough. Our optional second assignment, helped us to see how we can transform a regular NN pgm written using numpy into TF NN pgm.

Nav view search

Navigation

Search

2.3 - Hyperparameter tuning, Batch Normalization and Programming Frameworks