Various quotes here from all over the place:

Ancient Chinese Quotes on Youtube: https://www.youtube.com/watch?v=8vOojviWwRk

Three things never come back = Time, words and opportunity. So, never waste your time, choose your words, and never miss an opportunity - Confucius

There are 1000 lessons in a defeat, but only 1 in a victory - Confucius

Learn as if you are constantly lacking in knowledge, and as if you are constantly afraid of losing your knowledge - Confucius

Nature does not hurry, yet everything is accomplished - Lao Tzu

The one who asks a question is a fool for a moment but the one who never asks any is a fool for life - unknown

“If you only do what you can do, you will never be more than who you are” - Master Shifu

Strong minds discuss ideas, average minds discuss events, weak minds discuss people - Socrates

Smart people learn from everything and everyone, average people learn from their experiences, stupid people already have all their answers - Socrates

The only true wisdom is in knowing you know nothing - Socrates

The right question is usually more important than the right anwer - Plato

The person who says he knows what he thinks but cannot express it usually does not know what he thinks. — Mortimer Adler

 Courage is the ability to go from one failure to another without losing enthusiasm - Churchill

 There are two ways to conquer and enslave a nation. One is by the sword. The other is by debt. – John Adams 1826

As our circle of knowledge expands, so does the circumference of darkness surrounding it. ― Albert Einstein

 As areas of knowledge grow, so too do the perimeters of ignorance - Neil deGrasse Tyson

There is no greater education than one that is self-driven. — Neil deGrasse Tyson

The empire, long divided, must unite; long united, must divide  -- from Historic novel - Romance of the three kingdom. This simply states the obvious that Unity succeeds division and division follows unity. One is bound to be replaced by the other after a long span of time. This is the way with things in the world.

Difference between understanding and knowing - understanding is more important and thus the goal of learning.
 
"work expands so as to fill the time available for its completion" => Parkinson's Law
 
“EVERYONE IS A GENIUS! But if you judge a fish by its ability to climb a tree it will live its whole life believing that it is stupid.” — Albert Einstein.
 
“Education is not the learning of facts, but the training of the mind to think.”— Albert Einstein.
 
"We are better understood as a collection of minds in a single body than as having one mind per body" - Unknown. The gist of this is that when we are able to accept other people's view by keeping our minds open, we are going to go a long way towards being accepted and having constructive arguments.
 
"If you wish to make an apple pie from scratch, you must first invent the universe" - Carl Sagan
 
 Puzzle: Rich people need it. Poor people have it. If you eat it, you die. And when you die, you take it with you. What is it? => NOTHING

 A Woman's Loyalty is judged when the man has nothing And The Man's Loyalty is judged when he has everything => Somebody

 “Knowledge is not power. Knowledge is potential power. Execution of your knowledge is power” =>   Tony Robbins

"The only true wisdom is in knowing that you know nothing" => unknown

"If I owe you $1K, I've a problem, but if I owe you $1M, yo have a problem" => old saying

So many people spend their health gaining wealth, and then have to spend their wealth to regain their health. => unknown

 Taste success once, come once more => movie "83"

“What I cannot create, I do not understand.” – Richard Feynman

The best things in life we don’t choose — they choose us => Unknown

“All ideas are second-hand, consciously and unconsciously drawn from a million outside sources.” => Mark Twain

"Great genial power, one would almost say, consists in not being original at all; in being altogether receptive; in letting the world do all, and suffering the spirit of the hour to pass unobstructed through the mind." => Ralph Waldo Emerson

Optimism is Force Multilpier => Steve Ballmer's speech

“The stock market is a device for transferring money from the impatient to the patient” - Warren Buffett

"Innovation is taking two things that exist and putting them together in a new way" - Tom Freston

“The best time to plant a tree was 20 years ago. The second best time is now.” - Unknown

"The world is not driven by greed. It's driven by envy." - Charles T Munger (Warren Buffett's partner and a billionaire)

"The best thing a human being can do is to help another human being know more." - Charles T Munger 

If you don't find a way to make money while you sleep, you are going to work until you die - Warren Buffett

Three things that can make a smart person go broke => Liquor, Ladies and Leverage (the 3 L's LLL)  => Charlie Munger

Duniya mein itna gum hai, mera gum to kitna kum hai (There's so much sorrow in the world, that my sorrow is negligible) => Hindi Movie song

 "If you are not willing to learn, no one can help you. If you are determined to learn, no one can stop you." =>unknown

 There are only three ways a smart person can go broke: liquor, ladies and leverage => Charlie Munger (told by Warren Buffett in a shareholder meeting)

“My game in life was always to avoid all standard ways of failing, You teach me the wrong way to play poker and I will avoid it. You teach me the wrong way to do something else, I will avoid it. And, of course, I’ve avoided a lot, because I’m so cautious.” - Charlie Munger

The easiest person to fool is yourself => Richard Feynman (American Physicist)

Why is pizza made round, then packed in a square box but eaten as a triangle? => Unknown

 

 

Practical Aspects of Deep Learning: Course 2 - Week 1

This course goes over how to choose various parameters for your NN. Designing NN is very iterative process. We have to decide on the number of layers, number of hidden units in each layer, the learning rate to choose, what activation function to use for each layer, etc. Depending on the field or application where NN is being applied, these choices may vary a lot. The only way to find out what works, is to try a lot of possible combinations and see what works best.

 We looked at data set in ML, that typically is divided into training set and test set. We also have a dev set which is a set that we use to test out our various implementation of NN, and once we narrow it down to couple of NN that work best, we try those on test set to finally pick one. Training set is usually 99% of all data, while dev and test set are each small at 1% or less.

Bias and variance:

Underfitting: High Bias: Here training data doesn't fit too well with our ML implementation. The training set error is high, and dev set error is equally high. To resolve underfitting issues, we need NN with more layers, so that we can fit better

Overfitting: High variance: Here training data over fits with our implementation. The training set error is low, but dev set error is high. to resolve overfitting, we use more training data or use regularization schemes (discussed later).

Right fitting: Here data is neither under fitting nor over fitting.

Ideally we want low bias and low variance: implying training set error is low and dev set error is also low. Worst case is when we have high bias and high variance: implying training set error is high and dev set error is even higher, so out ML implementation did bad everywhere. We solve both issues of high bias and high variance by selecting our ML implementation carefully and then deploying additional tactics to reduce bias and variance.

In small data era, we used to do trade offs b/w bias and variance, as improving one worsened the other. But in big data era, we can reduce both bias and variance. Bias can be reduced by adding more layers to our network, while variance can be reduced by adding more training data.

Regularization: 

This is a technique used to reduce the problem of over fitting or high variance. The basic way we prevent over fitting is by spreading out weights, so that we don't allow over reliance on only a small set of weights. This makes our data fit less accurately, and by doing that it prevents over fitting. There are many techniques used to achieve this. Below are 2 of such techniques.

A. L1/L2 regularization:

This is done by lowering the overall weight values so that the weight terms are closer to 0. That way they have less of an impact. You can think of the new NN with lower weights as a reduced NN, where some of the weight terms in that network have vanished. Other way to see is that by having weights close to 0, our activation functions like sigmoid and tanh remain in liner region of their plot, so the whole NN becomes more like linear NN where we are just adding linear portions of all activation functions. This then becomes same as logistic regression, which is just a single layer linear NN.

To achieve regularization, we add sum of weights to the cost term, and try to minimize the new cost (including the weight terms). Then the cost lowering method, will try to keep weights also low so that overall sum of weights remain low. There are 2 types of Regularization:

L1 Regularization: Here we add modulus of weights to cost function:

For Logistic Regression: J(w,b) = 1/m * ∑ L(...) + λ/(2m) ∑ |w| = 1/m * ∑ L(...) + λ/(2m).||w|| , where w is summed over all i/p (i=1 to i=nx).

For L layer NN: Here w is a matrix for each layer. Here Regularization term added is λ/(2m) ∑ ||w[l]|| where we sum over all layers (layer 1 to layer L), adding all weight terms in matrix of each layer. i.e

||w[l]|| =   ∑ ∑ |w[l]i,j| where i=1 to n[l-1], j=1 to n[l], => all terms of matrix are added together (in L layer NN, dim of w[l] is (n[l], n[l-1])

L2  Regularization: Here we add modulus of square of weights to cost function:

For Logistic Regression:  J(w,b) = 1/m * ∑ L(...) + λ/(2m) ∑ |w|2 = 1/m * ∑ L(...) + λ/(2m) ||w||2, where ||w||2 is w.wT over all i/p (i=1 to i=nx).

For L layer NN: This is same as for L1 regularization, except that we do square of each weight term. Here Regularization term added is λ/(2m) ∑ ||(w[l])2|| where we sum over all layers (layer 1 to layer L), adding squares of all weight terms in matrix of each layer. i.e

||(w[l])2||  =   ∑ ∑ (w[l]i,j)2 where i=1 to n[l-1], j=1 to n[l], => all terms of matrix are squared and then added together. This is known as Frobenious norm instead of L2 normalization for historical reasons. L2 normalization is used when dealing with single summation as in Logistic Regression

 When calculating dw[l] (i.e dJ/dw) for L layer NN, we need to differentiate this extra term also. So, it adds an extra term λ/(m).w[l]. Then we updating w[l] = w[l] - α.dw[l] , we now have this extra term. So, w[l] = w[l] - α.(dw[l] + λ/(m).w[l]), where dw[l] refers to original dw[l] that was there before the regularization.

So, new w[l] = (1- α.λ/(m)).w[l] - α.(dw[l]) => So, we see that eqn remains of same form as earlier, except that w gets multiplied by a factor (1- α.λ/(m)). Since this  factor is less than 1, so weights are reduced from their original values. This is why L2 regularization is also called "weight decay", as we are kind of decaying weights by added L2 regularization.

λ = It's called as regularization parameter. It's another hyper parameter that needs to be tuned to see what works best for a given NN. lambda is a reserved keyword in python, so instead of using lambda, we use lambd as variable name for lambda

NOTE: in both cases above, we don't sum "b" (i.e we don't do + λ/(2m).b or + λ/(2m).b2) as that has negligible impact on reducing over fitting.

B. Dropout Regularization:

Here, we achieve regularization by dropping out weight terms, w, randomly on each iteration of cost optimization. This causes our algorithm to not depend on any weight term or a set of weight terms very heavily, since that term may disappear at any time, during any iteration of optimization. This causes the weights to be more evenly distributed, reducing the problem of over fitting. It may seem hanky-panky kind of scheme, but it works well in practice.

Inverted Dropout: A revised and more effective implementation of Dropout is inverted Dropout, where we multiply the activation values appropriately, so that our activation values remain unchanged, irrespective of how many hidden units we dropped.

NOTE: Dropout regularization is always applied only on training data, and NOT on test data. This is obvious, since once the weights are finalized by running dropout on training data, we have to use all those weights on test data.

C. Other Regularization:

1. data augmentation: We'll always achieve better regularization with more data. Instead of getting more data, we can use existing data to augment our triaing set. This can be done by using mirror images of pictures, zoomed in pictures, rotated pictures, etc.

2. Early stopping: This is another approach where we stop our cost optimization loop after a certain number of iteration, instead of allowing it to go for a large number of iterations. This reduces over fitting. L2 regularization is preferred over early stopping, as you can mostly get same oe better variance with L2 regularization than with early stopping.

Normalize inputs:

We normalize input vector x by subtracting it by mean, and dividing it by std deviation (or sq root of variance).

So, Xnormal = (Xorig -µ) / σ where mean (µ) = 1/m * Σ X(i)orig where we sum m samples for each X, and std deviation (σ ) = 1/m *√ ( Σ (X(i)orig- µ)2)

If there are 5 i/p vectors X1,...,X5, then we do this for each of the 5 vectors for all m examples. This helps, as subtracting i/p vectors by mean centers each X around origin. Similarly dividing it by std deviation, normalizes it so that for each dimention, X is scattered around by same range for all dimensions. This makes our i/p vector X more symmetrical, and so finding optimal cost goes more smoothly and faster.

Vanishing/Exploding gradients:

With very deep NN, we have the problem of vaishing or exploding gradients, i.e gradients become too small or too big. Prof Andrew shows it with an example, on how the final weight matrix becomes a matrix with exponent of "L". So, values greater than 1 in weight matrix, start exploding, while values less than 1 start vanishing (as they start going to 0). One of the ways to partially solve this is to initialize the weight matrix correctly.

Initializing Weight matrix:

For any layer l, o/p Z = w1.x1+....+wn.xn. If the number of weights is large, then we want weights w1..wn to be small, so that Z doesn't becomes too large. so, we divide each weight matrix element by n (In reality, we divide it by square root of n). This ensures that our weight elements don't get too big. Initializing to "0" doesn't work, as it's not able to break symmetry.

For random initialization, we multiply as follows:

1. tanh activation function: For tanh. it's called Xavier init, and is done as follows: W[l] = np.random.randn(shape) * np.sqrt(1/n[l-1]). We use size of layer (l-1) instead of "l", since we divide it by input layer size, and size for i/p of layer "l" is n[l-1].

2. Relu activation function. For Relu, it's observed that np.sqrt(2/n[l-1]) works better.

3. Others: Many other variants can be used, and we'll have to just try and see what works best.

Gradient Checking:

Definition of differentiaition of X: dF(x) = Lim(e->0) F(x+e) - F(x-e) / 2e, where e goes to 0 in limiting case.

We use this definition to check for gradient by comparing the value obtained using eqn above compared to the real gradient calculated using our formula. If the difference is large (i.e > 0.001), then we need to doubt the gradients of dw and db calculated using the pgm.

 

Programming Assignment 1: here we learn about how different initialization to weight matrix, results in totally different training accuracy.  We apply our different init mechanism to 3 layer NN:

  • zero initialization: doesn't work, unable to break symmetry. Gives worst accuracy on training set
  • large random initialization: very large weights cause vanishing/exploding gradient problem, so gives poor accuracy on training set.
  • He initialization: This works perfect, as weights are divided by "n" to have lower initial weights, resulting in very high training accuracy

Here's the link to pgm assigment:

Initialization(1).html

This project has 2 python pgm.

A. init_utils.py => this pgm defines various functions similar to what we used in previous assignments

init_utils.py

B. test_cr2_wk1_ex1.py => This pgm calls functions in init_utils. It does all 3 initialization as discussed above. We unknowingly did He initialization in previous week exercise.

test_cr2_wk1_ex1.py

 

Programming Assignment 2: here we use the same 3 layer NN as above. Now we apply different regularization techniques to see which works best. These are the 3 different  regularization applied.

  • No regularization: here test accuracy is lower than training accuracy. This is due to overfitting. Gives high accuracy on training set, but low accuracy on test set
  • L2 regularization: here we apply L2 regularization, which results in lower accuracy on training set, but better accuracy on test set. The parameter lambda can be tuned to achieve higher smoothing or lower smoothing of fitting curve. Very high lambda can result in under fitting, resulting in high bias.
  • Dropout regularization: This works best as we get lower training accuracy, but highest test accuracy.

Here's the link to pgm assigment:

Regularization_v2a.html

This project has 3 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

B. reg_utils.py => this pgm defines various functions similar to what we used in previous assignments.

reg_utils.py

C. test_cr2_wk1_ex2.py => This pgm calls functions in reg_utils. It does all 3 regularization discussed above (inluding no regularization)

test_cr2_wk1_ex2.py

 

Programming Assignment 3: here we employ the technique of gradient checking to find out if our back propagation is computing gradient correctly. This is an optional exercise that can be omitted, as it's not really needed in further AI courses.

Here's the link to pgm assigment:

Gradient+Checking+v1.html

This project has 3 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

B. gc_utils.py => this pgm defines various functions similar to what we used in previous assignments.

gc_utils.py

C. test_cr2_wk1_ex3.py => This pgm calls functions in gc_utils. It does the gradient checking

test_cr2_wk1_ex3.py

 

Course 2 - Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

 

This course bulids on deep NN. It has various techniques to optimize our NN to predict better. In absence of right parameters, your NN may not even work. It is a course that can be finished at a good speed. It has multiple python exercises, which should be completed. It has 3 sections:

1. Practical Aspects of Deep Learning: This talks about how to adjust parameters like initialization values, and how to choose initial values that will make our NN work. Can be watched in fast mode. However, the 3 exercises whould be finished. They don't take too much time.

2. Optimization algorithms: This goes over how to optimize the algo for finding thew lowest cost. It talsk about techniques of gradient descent (gd) as mini batch, gd with momentum, gd with RMs prop, gd with Adam and learning rate decay. There is a programming assignment to apply these various techniques on a NN, and observe the impact.

3. Hyperparameter tuning, Batch Normalization and Programming frameworks: This introduces google's framework called TensorFlow where we write a program to classify sign digits from 0 to 5.

 

PT flow:

Any STA signoff tool needs to be run separately for all the timing corners after synthesizing and doing final PnR netlist. This ensures that timing is met across all corners, including those that haven't been checked in synthesis or the PnR flow. Here we detail the flow for running Primetime from Synopsys.

Ex Dir=> /project/Primetime/digtop

Setup:

Before running PT, we need to have .synopsys_pt.setup in this dir. This is similar to .synopsys_dc.setup, that we use for DC. Only difference is that link library now has *_CTS.db, as the gate level netlist has CTS cells.CTS cells are clock cells, they may be named differently in other libs. This has lib path, etc (look in "synthesis DC" for more details). We don't specify target_library and link_library in this file, but when running PT for each corner, as we have to use diff .db files for diff corners.
1. search_path: set to path where all .db files are located.
set search_path "$search_path /home/pdk/lib270/r3.0.0/synopsys/bin \
../../Memories/sram220x64/cad_models"
2. target_library: set to min corner library
set target_library MSL270_W_125_2.5_CORE.db => optional as it's specified again when running diff corners.
3. link_library (or link_path my also be used): set to mem and target library.
set link_library { * PML30_W_150_1.65_CORE.db PML30_W_150_1.65_CTS.db } => optional as it's specified again when running diff corners.

So, only step 1 is needed. Other steps not needed.

Invoking PT:

to run PT in cmd line, type

$ pt_shell -2011.12 => If we don't provide version number (-2011.06 or -2011.12 etc), it picks up default version, set by sysadmin. Once pt_shell comes up, from within PT shell, we can write PT cmds.
$ pt_shell> source file_name => executes scripts in file_name.

To execute cmd script upon startup, start pt shell as:(In this case, the script runs automatically w/o any manual cmds from user. This is the preferred way for running PT once you have debugged the script, and it's working correctly)
$ pt_shell -2011.12 -f file_name

GUI mode:

Apart from writing cmds or running script in pt_shell, you can also invoke gui from within PT. Once pt shell comes up, type below cmd

pt_shell> gui_start

Now we can select a cell, pin, wire etc on the schematic, and we'll see this cmd indicating our selection on the shell

pt_shell> change_selection [get_pins mod1/sync_2/out1] => This way it's lot easy to get pin names, etc instead of figuring it out by looking athe design

pt_shell> get_selection => This will show whatever was selected.i.e for above selection, it'll show "mod1/sync_2/out1"

pt_shell> report_timing -thr [get_selection] => This will report timing through the pins selected.

Running PT cmd script:

There are 2 PT scripts we run. One for running design across various PVT corners, and the other for generating SDF files. First let's look at run_pt_all script which runs STA across all corners. Next we'll run sdf generation script. sdf files are needed only for running gate level sims (GLS). So, if you don't plan to run GLS, then you can skip generation of SDF.

1. run_pt_all:  We run "run_pt_all" script, which calls PT shell for all runs. PT is run across various PVT corners. For 130nm or higher tech nodes, just running it across fastest PVT corner and slowest PVT corner suffices. However, when you run timing for lower node designs (ones below 130nm), we need to run PT across lot more corners, as just the fastest PVT and slowest PVT may not capture all timing paths in design(due to large variations in transistors across the chip, which may result in some worst case paths showing up in intermediate PVT corners). Generally PT is run for 2 cases based on functionality:

  1. Functional mode (No scan): This is the normal functional run of chip. Here "scan_mode" is set to 0.
  2. Scan mode (includes Scan Shift and Scan Capture):This is when the part is put in scan mode to test scan chains. Here "scan_mode" is set to 1.
  3. Scan_Vbox mode (optional): This runs scan mode on chip, the same as 2 above, but here we apply much larger voltage than the PVT max corner and much lower voltage than PVT min corner. These voltages are supposed to be bounding boxes for our PVT corner. We run these ultra high or ultra low voltages for scan mode only (not functional), as we want to see if the chip still functions in scan mode.

Running PT for 6 cases noscan(min/max), scan(min/max), scan_vbox(min/max). max refers to max delay lib being used, while min refers to min delay lib being used for that run. Below we show which reports are being generated for each mode.


A. NO SCAN: scan_mode_in set to 0 (in case_analysis.tcl). so normal clks used. If we don't set scan_mode to 0, then there will be too many paths to analyze as both scan_mode=0 and scan_mode=1 timing analysis is run. So, we separate them out.


1. nonscan_max: max delay lib being used
rpts/digtop.max_timing_post_noscan.max.rpt => setup check with max PVT delay (W, 150C, 1.65V) and max interconnect delay (max.spef) slow corner (PCR=max)
rpts/digtop.min_timing_post_noscan.max.rpt => hold check with max PVT delay (W, 150C, 1.65V) and max interconnect delay (max.spef) slow corner (PCR=max)
rpts/digtop.post_noscan.max.rpt => comprehensive report combining both setup and hold checks for slow corner.


2. nonscan_min: min delay lib being used
rpts/digtop.max_timing_post_noscan.min.rpt => setup check with min PVT delay (S, -40C, 1.95V) and min interconnect delay (min.spef) fast corner (PCR=min)
rpts/digtop.min_timing_post_noscan.min.rpt => hold check with min PVT delay (S, -40C, 1.95V) and min interconnect delay (min.spef) fast corner (PCR=min)
rpts/digtop.post_noscan.min.rpt => comprehensive report combining both setup and hold checks for fast corner.

B. SCAN: scan_mode_in set to 1 (in scan.sdc). In scan, spi_clk used as clock for shift and capture. scan.sdc has clk_defn for spi_clk, case_analysis to set scan_mode to 1 and all IO delay set wrt scan_clk. It should not have any false paths as all of the digital logic is run by spi_clk. spi_clk is run at lower freq, and i/o delays are set wrt spi_clk.
IMP NOTE: scan_enable pin IO delay should be matched to real Tetramax delay. Otherwise even if scan_en is not buffered appr and has large transition delays, it may still be able to meet timing wrt rising edge of clk. Thus it may not be captured as a violation here but may show up in Tetramax gate level sims. scan_en is the only IO pin that has real timing path to CLK in DUT. So, it should never be tied to 0 or 1 (by setting set_case_analysis), as that will cause constant propagation, so we will not get a path with rising/falling edge of scan_en pin (it will be reported as unconstrained path in PT). That may hide real failure on this path.

1. scan_max: max delay lib being used

2. nonscan_min: min delay lib being used

C. VBOX: vbox tests run during scan to see if the chip still functions (simple Iddq patterns run to see if lkg is within limits). We choose just 2 corners: vbox_hi=strong tran at high voltage (min delay), and vbox_lo=weak tran at low voltage (max delay), to see if setup, hold etc passes.
#NOTE: We mostly care about vbox_hi rpts to be clean, as that indicates that there is enough hold slack (setup rpts will be clean at vbox_hi anyway as it's run at much faster corner). vbox_lo will be mostly clean for hold, but may have violations for setup as it's a very slow corner. In nutshell, there should be no hold viol at any of vbox_lo/hi, but only setup viol at vbox_lo (assuming design is barely meeting setup timing).


1. scan_vbox_hi: digtop.post_scan.vbox_hi.rpt: high voltage vbox conditions with min PVT delay (S, 25C, 3.2V) and min interconnect delay (min.spef) fast corner at 25C (PCR=vbox_hi). Run at normal freq (12MHz)


2. scan_vbox_lo: digtop.post_scan.vbox_lo.rpt: low voltage vbox conditions with max PVT delay (W, 25C, 0.95V) and max interconnect delay (max.spef) slow corner at 25C (PCR=vbox_lo). Run at 1/2 normal freq (here it's 6.25MHz, as high freq may not be supported at vsuch low voltages)

details of run_pt_all: run_pt_all script calls pt_shell 6 times as shown above for 6 different corners. We only show the script for noscan_max (case A, bullet 1 above) and scan_max (case B, bullet 1 above). Similarly we have it for noscan_min, scan_min, vbox_max and vbox_min

All scripts below run these basic steps:

  1. Read Library: set target_lib and link_lib to appr PVT corner
  2. Read Netlist and SPEF: read verilog gate netlist (from PnR tool), and spef file (from QRC extraction for appr PVT corner).
  3. Read constraints: read constraints for clk,false_paths,case_analysis,etc (func has func.sdc while scan has scan.sdc)
  4. Report timing: report_timing for both max_delay(setup) and min_delay(hold)

no scan max script (check_timing_post_nonscan_max.tcl): Runs functional PT run for max delay corner (i.e worst PVT, that gives max delay). This is for case A, bullet 1 above.

pt_shell -2010.06 -f scripts/check_timing_post_nonscan_max.tcl | tee logs/run_pt_post_nonscan_max.log  => This script sources below 2 scripts:
source -echo scripts/import_post_max.tcl => This runs step 1 and 2 (Read lib, netlist and spef)
source -echo scripts/constrain_post_nonscan.tcl => This runs step 3 and 4 (Read constraints, report timing)

1. import_post_max.tcl => This script is called for 2 max corners above. We have similar import_post_min.tcl for 2 min corners above

#1. Read Library: Read max delay lib. For min corner, we read min delay lib.

set target_library { LIB_W_150_1.65_CORE.db LIB_W_150_1.65_CTS.db } => IMP: whenever, we have new line starting it should have "\" to continue on next line. If starting/closing braces { } are not on same line as file name, then we should have "\" to continue on next line. else we get linking error.
set link_library { * LIB_W_150_1.65_CORE.db LIB_W_150_1.65_CTS.db } => IMP: same as above. "\" should be used when starting new line, or we get linking error.

NOTE: target_library above refers to max delay library. This max delay lib is used for both setup/hold runs. If we want to use max delay lib for setup and min delay lib for hold, we should do this:
set_min_library LIB_W_150_1.65_CORE.db -min_version LIB_S_-40_1.95_CORE.db => specifies max lib, and corresponding min lib by using -min_version.
#For OCV runs, where we want to have min/max library for data/clk path for both setup/hold, in set_operating_condition, we should specify max and min libraries to use for ocv runs.

We can also use following 2 cmds instead of previous 2 cmds:
#set link_path "* LIB_W_150_1.65_CORE.db LIB_W_150_1.65_CTS.db"
#set default_lib "LIB_W_150_1.65_CORE.db LIB_W_150_1.65_CTS.db"

#To make sure, that when any link is unresolved, we get appr error, set the var below in .synopsys_pt.setup
set link_create_black_boxes false => this prevents PT from creating blackbox for unlinked ref, and decalring linking as successful

#2A. Read netlist: read gate netlist generated from PnR tool
read_verilog /db/.../final_pnr_files/digtop_final_route.v

set TOP "digtop"
current_design $TOP => working design for PT.
link => link design to resolve all references in design. shows all lib that are being used to link design. No module/gate should be unresolved here, as that would mean missing module/gate defn

#additional cmds for debug
list_libraries => lists all libraries and their paths
report_design => lists attr of current design, incl min/max op cond used, WLM, design rules, etc.
report_cell => This reports all cells in current design, and .lib they refer to. If current_design is set to DIG_TOP, then it only shows cells for DIG_TOP and NOT for sub-modules within it. This is helpful to find out which .lib is used for timing run (especially if multiple .lib have been loaded in memory)
report_cell "dig_top/cell1" => This reports cell1, it's ref, it's area and other attr.

report_reference => This shows .lib references for cells in current design

#2B. Read spef: read max spef file (generated from rc extractor from within PnR tool)
read_parasitics -format spef /db/.../final_pnr_files/digtop_qrc_max_coupled.spef

2. constrain_post_nonscan.tcl: This file contains all the constraints for functional mode runs. This sources functional sdc file.

#these below settings can be done in .synopsys_pt.setup too.
set timing_self_loops_no_skew true
set timing_disable_recovery_removal_checks "false"

current_design $TOP

#3A. Read constraints: We have 2 options for importing constraints in PT

#OPTION 1:  we source all constraints files individually that we used in DC (instead of using autogenerated file in DC). We don't source env_constraints.tcl (that was used in DC) in PT as we don't want "set_operating_conditions" and set wire load model directives in this file to be used in PT. 
source -echo /db/Synthesis/digtop/tcl/clocks.tcl => clks defined (no scan clk in this)
source -echo /db/Synthesis/digtop/tcl/constraints.tcl => all i/o delays + environment specified.Use same values as used in synthesis. See sdc section (Synopsys (standard) design constraints) for details of these cmds. The units are not specified in below cmds, but are instead based on "set_units" cmd or cap units from the lib that was the last one to be loaded.

  • set_driving_cell => to set i/p driver cell. If driving cell is not set, then we need to set i/p tran time via: set_input_transition 100 [get_ports IN_*] => sets tran time to 100 units. In this case, it's 100ps as lib has "time_unit : "1ps".
  • set_load => to set specified load on ports and nets. We set it on ports only (as we don't want to specify our own cap on internal nets, we let tool calc cap on nets). -max or -min options specify max or min cap to be used in max or min corner runs (as we may not want to use same cap for both max and min runs, applicable only when running timing in max/min mode)
    • ex: set_load 5 [get_ports OUT_*] => sets load of 5 units on all ports OUT_*. Units are based on lib units loaded. In this case, it's 5 ff as lib has "capacitive_load_unit (1.000000, ff);" defined
    • report_port => this cmd used to to report cap + other attr on all ports (or specified port if port specified, i.e report_port [get_ports OUT_*]
  • set_input_delay / set_output_delay => sets i/p, o/p delay on ports


source -echo /db/Synthesis/digtop/tcl/false_paths.tcl => false paths
source -echo /db/Synthesis/digtop/tcl/multicycle_paths.tcl
source -echo /db/Synthesis/digtop/tcl/case_analysis.tcl => scan mode is set to 0. other flops set for func mode. case_analysis.tcl has following:

  • set_case_analysis 0 scan_mode_in => turn OFF scan mode.

#OPTION 2: Instead of sourcing all the above constraints files, we can use constraints.sdc file that is autogenerated by DC to get all the constraints.
source -echo /db/Synthesis/digtop/tcl/constraints.sdc => It has env _constraints (pin loads, driving cells), clks, i/o delays, false paths, case-analysis, dont_touch, etc (basically all the constraints in option 1 above, except the env_constraints). Since the env constraints is also present in this file, we get section that has "set_operating_conditions" set for 1 PVT corner for which we ran DC (it comes from env_constraints.tcl file that was used for DC). So when running PT for other corners, we 'll get errors like "Error: Nothing matched for lib (SEL-005)". So, we'll need to comment out that line when running PT with the above autogenerated file. Or we can comment out the whole env_constraints section.

#after applying path exceptions, do report_exceptions to see list of timing exceptions applied
report_exceptions => -ignored will show those cmds too that are fully ignored. This is important to do as it will tell us if any of the FP/MCP are getting applied or dropped due to syntax errors, path not existing, etc.

#3B. Other constraints: We set clk to propagated, and analysis type to ocv.

#propagate clks: clk is propagated here (In DC, we didn't use propagated clk, so clk was treated as ideal, that means even gate delays in clk path weren't included anywhere in timing reports. We treat clk as ideal in DC, because buffers are going to be inserted later during PnR, so we don't want DC fixing clk paths, with it's own buffers). With this setting, all gate+buffer delays included in clk delay, when running timing.
set_propagated_clock [all_clocks] => Very imp to set this, else timing reports will be all incorrect.

#We set analysis type to OCV even when running it in single mode (specifying only max library for max run, and only min library for min run). So, in reality it's not running ocv here, as we have only one lib loaded to run on all paths. OCV is necessary for PBTA (path based timing analysis to be discusse later) to work.

set_operating_conditions -analysis_type on_chip_variation => This cmd explained in detail in "PT - OCV" section.

#4: Reports: Timing reports generated below. report_timing is the main cmd that causes PT engine to run timing.

#rpt file for setup (-delay max)
set rptfilename [format "%s%s" $mspd_rpt_path $TOP.max_timing_post_noscan.$PCR.rpt]
redirect $rptfilename {echo "digtop constrain_post_noscan.tcl run : [date]"}
redirect -append $rptfilename {report_timing -delay max -path full_clock_expanded -max_paths 100} => provides timing, most powerful cmd in PT.

#rpt file for hold (-delay min)
set rptfilename [format "%s%s" $mspd_rpt_path $TOP.min_timing_post_noscan.$PCR.rpt]
redirect $rptfilename {echo "chip constrain_post_noscan.tcl run : [date]"}
redirect -append $rptfilename {report_timing -delay min -path full_clock_expanded -max_paths 100}

#rpt file for all violations
set rptfilename [format "%s%s" $mspd_rpt_path $TOP.post_noscan.$PCR.rpt]
redirect $rptfilename {echo "digtop constrain_post_noscan.tcl run : [date]"}
redirect -append $rptfilename {report_clocks }
redirect -append $rptfilename {check_timing -verbose} => check_timing checks for constrain problems.
redirect -append $rptfilename {report_disable_timing} => You can also eliminate paths from timing consideration by using the set_disable_timing command. report_disable_timing reports such paths. It shows disabled timing arcs for all the cells. Most of them are "u=user-defined" paths. Spare cell paths are reported as "p=propagated constant" since all i/p pins are tied, so no paths exist. Some paths get reported with"c=case-analysis" since case analysis ties some pins.
ex:

Cell or Port                From    To      Sense                 Flag  Reason
--------------------------------------------------------------------------------
Mod1/req/u4             A1      ZN      positive_unate     C     A2 = 0 => Arc from pin A1 to pin ZN of cell u4 is disabled.

redirect -append $rptfilename {report_constraint -all_violators} => reports the results of constraint checking done by PrimeTime. -all_violators reports all violations incl setup, hold, max cap, max FO, max transition time (slew rate or transition time is measured 10/90 or 20/80 or whatever based on the characterized lib and slew derate factor), min pulse width, clk gating checks, recovery checks and removal checks. {report_constraint -all_vio -verb} gives verbose info about violations

#report_global_timing -group [get_path_group CLK*] => generates a top-level summary of the timing for the design. Here generates a report of violations in the current design for path groups whose name starts with 'clk'. If we run this in a for loop with all path_groups, then we get separate reports for each group.
ex: foreach_in_collection path_group [get_path_groups] { report_global_timing -group $path_group >> viol_summary.rpt } => reports timing for groups = **async_default**, **clock_gating_default**, **default**, CLK10, SYSCLK20, and other clocks in design.

#report_analysis_coverage => Generates a report about the coverage of timing checks
#report_analysis_coverage -status_details {untested} -check_type {setup} => Once we see coverage missing, we can get detailed report about status of untested, violated or met checks. Check types can be "setup, hold, recovery, removal, clock_gating_setup, clock_gating_hold, min_pulse_width, min_period, nochange".

#report_clock_timing -type summary -clock [get_clocks *] => lists clock timing info summary, which lists max/min of skew, latency and transition time over given clk n/w.
# report_clock_timing -type skew -setup -verbose -clock [get_clocks *] => This gives more detailed info about given clk attr (over here for skew). By default, the report displays the values of these attributes only at sink pins (that is, the clock pins of sequential devices) of the clock network. Use the -verbose option to display source-to-sink path traces.

#PBTA: pbta is path based timing analysis. By default, pba_mode is set to "none" => pbta is not applied (gba is applied). It's useful as PBA doesn't have pessimism of GBA, so we always use PBA though at expense of runtime.
#report_timing -pba_mode path => pba applied to paths after collecting, but worst case may not be reported. It takes the worst slack path, and just recalcualtes it, but it's possible that next worst case path right behind it, doesn't get as much improvement from pba, and so might become worst case path, but we never analyzed this next worst case path.
#report_timing -pba_mode exhaustive => provides worst case path after recalc. It looks at all the paths to a particular endpoint, and applies pba on each path to that endpoint. Path with worst slack after applying recalc to all paths to each endpoint is shown. so, the optimism inherent with "-pba_mode path" is not present anymore.

#rpt file for pbta setup (-delay max)
set rptfilename [format "%s%s" $mspd_rpt_path $TOP.pbta_max_timing_post_nonscan.$PCR.rpt]
redirect -append $rptfilename {report_timing -pba_mode exhaustive -crosstalk_delta -transition_time -delay max -path full_clock_expanded -nworst 10 -max_paths 500 -slack_lesser 1.0} => -crosstalk_delta reports delta delay and delta transition time, which were calculated during crosstalk SI analysis (provided PTSI is enabled). -transition_time reports transition time which is helpful to figure out nets which have very slow transition due to crosstalk.

#rpt file for pbta hold (-delay min)
set rptfilename [format "%s%s" $mspd_rpt_path $TOP.pbta_min_timing_post_nonscan.$PCR.rpt]
redirect -append $rptfilename {report_timing -pba_mode exhaustive -crosstalk_delta -transition_time -delay min -path full_clock_expanded -nworst 10 -max_paths 500 -slack_lesser 1.0}

scan max script (check_timing_post_nonscan_max.tcl): Runs scan PT run for max delay corner (i.e worst PVT, that gives max delay). This is for case B, bullet 1 above.

pt_shell -2010.06 -f scripts/check_timing_post_scan_max.tcl | tee logs/run_pt_post_scan_max.log  => This script sources below 2 scripts:
source -echo scripts/import_post_max.tcl => This runs step 1 and 2 (Read lib, netlist and spef)
source -echo scripts/constrain_post_scan.tcl => This runs step 3 and 4 (Read constraints, report timing)

1. import_post_max.tcl => This script is same as what we used in func mode above

2. constrain_post_scan.tcl: This file is used for scan runs (and is diff than func script above). This sources scan sdc file. No other constraints files are sourced when running scan. This is because the constraints for scan are totally different than ones in func mode.
source -echo /db/.../scan.sdc
#scan.sdc has following:
create_clock -name spi_clk -period 66 -waveform { 0 33 } [get_ports {spi_clk}] => create clk for PT and specify its characteristics. rising edge at 0ns and falling edge at 33ns.
set_propagated_clock [get_clocks {spi_clk}]
set_case_analysis 1 [get_ports {scan_mode_in}] => turn ON scan mode.
set_input_delay 10 [all_inputs ] => i/p timing conditions
set_output_delay 10 [all_outputs ] => o/p timing requirements
set_dont_touch scan_inp_iso
set_driving_cell -lib_cell IV110 [all_inputs]
set_load 4.2 [all_outputs]

 

2. SDF generation script: Once we have run PT across all 6 corners as shown above, we use PT to generate SDF files. Thses are delay files that will be used in gate level simulations. If we don't plan to run GLS, then we don't need to run this step.

sdf file generation requires:

  • .lib files => for all cells/macros as they have cell delays, setup/hold timing checks, c2q arcs etc.
  • gate level verilog netlist => netlist is needed so that all nodes of netlist are appened with appr cell delay and net delay. nets which are not connected to anything are reported as driverless nets
  • spef file => has R,C info for all nets (no dly info for nets and cells)

NOTE: we don't require verilog model files for cells, macros, etc as we are just generating delay file. All arc info comes from .lib files. Delays for sdf are calculated from R,C in spef file and cell delay (with appr load for that cell coming from spef file) from .lib file.

we've 2 scripts for generating max and min sdf. Each net and instance in verilog is matched with a net parasitic in spef file, and a cell timing in .lib file. target library and link library are set to *.db (liberty files for gates) for that particular corner, just as we do in DC.


1. max sdf: (for max delay, so worst timing corner)
> pt_shell -2010.06 -f scripts/MaxSDF.tcl | tee logs/run_GenMaxSDF.log

#MaxSDF.tcl has following: read .lib files, verilog file and max.spef file (generated from EDI). write_sdf writes final sdf file, other options in write_sdf needed to have aligned sdf, otherwise when we do sdf annotation, we may not get sdf file arcs aligned with verilog model file arcs. sdf file arcs come from .lib arcs, while during annotation, we check it against arcs in verilog model files.

set target_library { PML48_W_125_1.35_COREL.db \
PML48_W_125_1.35_CTSL.db \
felb2x01024064040_W_125_1.35.db }
set link_library { * \
PML48_W_125_1.35_COREL.db \
PML48_W_125_1.35_CTSL.db \
felb2x01024064040_W_125_1.35.db }
echo $target_library

read_verilog /db/NIGHTWALKER/design1p0/HDL/FinalFiles/digtop/digtop_final_route.v
current_design digtop
link

read_parasitics -format spef /db/NIGHTWALKER/design1p0/HDL/FinalFiles/digtop/digtop_qrc_max_coupled.spef

#timing checks
check_timing -verbose => report shows all endpoints as unconstrained as no sdc file provided. OK
report_timing => report shows no constrained paths, as no clk provided. OK
report_annotated_parasitics -max_nets 150 -list_not_annotated => Provides a report of nets annotated with parasitics in the current design for both internal and port nets. -list_not_annotated lists nets that are not back annotated.
write_sdf -version 3.0 \ => SDF version 1.0, 2.1 or 3.0
-exclude {default_cell_delay_arcs} \ => specifies which timing values are to be excluded from sdf file. default_cell_delay_arcs indicates that all default cell delay arcs are to be omitted from the SDF file if conditional delay arcs are present. If there are no conditional delay arcs, the default cell delay arcs are written to the SDF file. NOTE: This may be an issue when running gatesims, as verilog models will have default_delay_arcs, while those will be missing from sdf files, so annoatation will have "missing annotation" warnings.
-include {SETUPHOLD RECREM} \ => SETUPHOLD:combine SETUP and HOLD constructs into SETUPHOLD. RECREM:combine RECOVERY and REMOVAL constructs into RECREM.
-context verilog \ => context for writing bus names for verilog, vhdl or none, so that [], () are not escaped.
-no_edge \ => SDF should not include any edges (posedge or negedge) for both comb and seq IOPATHs. It takes the worst of posedge/negedge values and assigns it to the delay arc
-no_negative_values {cell_delays net_delays} \ => Specifies a list of timing value types whose negative values are to be zeroed out when writing to the SDF file. Allowed values for timing values are timing checks, cell delays and net delays.
sdf/digtop_max.pt.sdf
quit

#look in logs/run_GenMaxSDF.log for errors.
A. look "report_annotated_parasitics" section detailed info. Here all internal nets (nets connected only to cell pins) and boundary/port nets (net connected to any I/O port of top level design, this should match the number of I/O ports in design) are reported. These nets are classified into pin to pin nets, driverless nets and loadless nets. All nets should connect from pin to pin, unless they are either floating o/p (loadless nets) or floating i/p (driverless nets). Nets which have no driver and no load are counted as driverless nets. Nets are everything reported as wires in digtop_final_route.v file. Any wire in this netlist that it's not able to find associated with any cell, it reports those as "floating nets". NOTE: parasitics from spef file are only annotated on nets, and not to cells. These parasites cause INTERCONNECT delay on nets, and eventually affect the loading on cells. Eventually cell arc from .lib file is used to find out real delay based on cell o/p pin load.
B. look warnings in "write_sdf" section. Most common one is "The sum of the setup and hold values in the cell 'soc_top/i2cBlk/i2cLink/sdaOut_reg' for the arc between pins 'CLK' and 'SCAN' is negative, which is not allowed. To make it positive, the minimum hold value has been adjusted from -0.613501 to -0.594033. (SDF-036)". This is because setup+hold should be > 0

2. Min sdf: for min delay, so fastest timing corner chosen.
> pt_shell -2010.06 -f scripts/MinSDF.tcl | tee logs/run_GenMinSDF.log

#MinSDF.tcl has following: only diff is min delay lib, and min spef chosen

set target_library { PML48_S_-40_1.65_COREL.db \
PML48_S_-40_1.65_CTSL.db \
felb2x01024064040_S_-40_1.65.db }
set link_library {* \
PML48_S_-40_1.65_COREL.db \
PML48_S_-40_1.65_CTSL.db \
felb2x01024064040_S_-40_1.65.db }

echo $target_library

read_verilog /db/NIGHTWALKER/design1p0/HDL/FinalFiles/digtop/digtop_final_route.v
current_design digtop
link

read_parasitics -format spef /db/NIGHTWALKER/design1p0/HDL/FinalFiles/digtop/digtop_qrc_min_coupled.spef

check_timing -verbose
report_timing
report_annotated_parasitics -max_nets 150 -list_not_annotated
write_sdf -version 3.0 \
-exclude {default_cell_delay_arcs} \
-include {SETUPHOLD RECREM} \
-context verilog \
-no_edge \
-no_negative_values {cell_delays net_delays} \
sdf/digtop_min.pt.sdf
quit

 

PT reports:
----------
PT reports timing for clk and data path. 1st section "data_arrival_time" refers to data path from start point, while 2nd section "data_required_time" refers to clk path of end point. 1st section shows path from clk to data_out of seq element and then thru the combinational path all the way to data_in of next seq element, while 2nd section shows primarily the clk path of final seq element, ending at clk pin. In the 2nd section, it shows the final "data check setup time" inferred from .lib file for that cell.

reports are shown for per stage. A stage consists of a cell together with its fan out net. So, transition time reported is at the i/p of next cell. delay shown is combined delay from i/p of cell to o/p of cell going thru the net to the i/p of next cell. & in report indicates parasitic data.

Ex: a typical path from one flop to other flop
Point Incr Path
------------------------------------------------------------------------------
clock clk_800k (rise edge) 1.00 1.00 => start point of 1st section
clock network delay (propagated) 3.41 4.41
.....
Imtr_b/itrip_latch_00/SZ (LAB10) 0.00 7.37 r
data arrival time 7.37
-----
clock clk_800k (rise edge) 101.00 101.00 => start point of 2nd section (usually starts at 1 clk cycle delay, 100 ns is the cycle time here)
clock network delay (propagated) 3.85 104.85
.....
data check setup time -0.04 105.76 => setup time implies wrt clk, data has to setup. So, we subtract setup time from .lib file to get data required time (as +ve setup time means data should come earlier)
data required time 105.76
------------------------------------------------------------------------------
data required time 105.76
data arrival time -7.37
------------------------------------------------------------------------------
slack (MET/VIOLATED) 98.39

Course 1 - week 3 - Shallow Neural Network:

This course introduces 2 layer Neural networks. NN was introduced in previous lecture, but it was mostly logistic regression. In logistic regression, we took a linear function f(x), assigned weights to various pixels, and computed if the picture can be classified as cat or not. It was single layer, as input X passed thru only one function f(x) = σ(w1*x1+w2*x2+...+wn*xn + b).

In Multi layer NN, we pass input X thru 2 functions f(x) and g(x) which may be same or different. If we choose f(x) as a func above, then f(x) returns single value, and passing it thru another function g(x) doesn't give anything new. i.e g(x) and f(x) could be combined as one function h(x). So, in above example, we can combine sigmoid function with g(x) to give a new function h(x)=g(σ(x)). This arrangement implies that instead of choosing sigmoid as an activation function, we chose some other function h(x) as activation function. So, we just replaced one function with another, and 2 layer result could have been achieved with one layer.

What if we allow a combination of f(x) functions to get more curves on the surface that's trying to fit our data set (in case of cat picture, it's fitting our pixels better)? Let's try to make various combinations of f(x) as f1(x), f2(x), etc. Then we can combine these f1(x), f2(x), ... with varying weights and feed that combination to g(x).So, this is what it would look like:

f1(x) = σ(w11*x1+w12*x2+...+w1n*xn + b1)

f2(x) = σ(w21*x1+w22*x2+...+w2n*xn + b2)

..

fk(x) = σ(w21*x1+w22*x2+...+w2n*xn + b2)

Now, we define g(x) the same way as f(x), but now the inputs are the outputs of above functions. Here we assign weights to functions f1(x), f2(x), ... and pass it thru sigmoid func to get g(x)

g(x) = σ(v1*f1(x)+v2*f2(x)+...+vn*fk(x) + c)

It turns out that this gives a better fit than the logistic regression fit that we attained in week 2 example. Reason is that g(x) in logistic regression was of form g(x) = σ(v1*x1+v2*x2+...+vn*xn + c), but now instead of having x1,x2,... in it's input, it has functions of x1,x2,.. in it's input (i.e f1(x1,x2,...), f2(x1,x2,...),...). This allows it to take more complicated shapes and fit the given data better.

2 Layer NN:

The above scheme becomes a 2 layer NN. It's called a shallow NN, as it has very few layers (in our example, only 2 layers). We can extend this concept from 2 layers to any number of layers, and surprisingly (or may be not so after all), the fit keeps on becoming better. This is because we have more and more dimensions of freedom in playing with variables to get better fit. We may be able to achieve higher accuracy with logistic regression, but it will need infinitely large number of weights to fit the curve. And still, it won't be able to fit the data as it won't be able to generate any curves with a linear function.

Let's revisit the section on "Best Fit Function". There we saw that sigmoid functions can be linearly added and fed into a sigmoid function to generate complex shapes. We saw plots for 2 dimensional i/p (i.e x,y), but it can be generalized to any number of inputs. By using appr weights and adding sigmoid functions, we were able to generate complex shapes.

ReLU or any other non linear functions can also be used instead of sigmoid functions.

NOTE: One very thing to keep in mind is that weight W need to be initialized to random values, instead of being initialized to 0. The lecture explains why.

Programming Assignment 1: This is a simple 2 layer NN. It tries to predict if a given dot is red or blue given it's location coordinate (x,y). Since the shape is in form of a flower, the 1 layer NN with it's linear equation can never form a boundary that can separate out the blue and red petals (as linear eqn can't form complex surface). Only 2 layer NN and higher layers can form a complex surface that can separate out various regions. We'll run our pgm thru both 1 layer NN and 2 layer NN.

Here's the link to pgm assigment:

Planar_data_classification_with_onehidden_layer_v6c.html

This project has 3 python pgm, that we need to understand.

A. testCases_v2.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases_v2.py

B. planar_utils.py => this is a pgm that defines couple of functions. 

planar_utils.py

These functions are:

  • load_planar_dataset(): This function builds coordinates x1,x2 and corrsesponding color y (red=0, blue=1). The array X=(x1,x2) and Y for all the points is returned back. So, no database is loaded here from any h5 file. It's built within the function.
  • load_extra_datasets(): This loads other optional datasets as blobs, circles, etc. These are on same style as petals, where a linera logistic regression can never achieve high enough accuracy.
  • plot_decision_boundary(): This plots the 2D contour of the boundary where the function changes value from 0 to 1 or vice versa. However, this boundary is better visualized in 3D. So, I added options for 3D contour, 3D surface and 3D wireframe (on top of default 2D contour). I've set 3D surface as default, as that gives the best visual representation.
  • sigmoid(): This calculates sigmoid for a given x (x can be scalar or an array)

We'll import this file in our main pgm.

C. test_cr1_wk3.py => This pgm calls functions in planar_utils.  Here, we define our algorithm for 2 layer NN to find optimal weights, by trying out algorithm on training data.. We then apply those weights on training data itself to predict whether the whether the dots were red or blue. There is no separate testing data. We just want to see how well our surface fits training data. Below is the whole pgm:

test_cr1_wk3.py

Below are the functions defined in our pgm:

  • layer_sizes() => Given X,Y as i/p array, it returns size of input layer, hidden layer and output layer
  • initialize_parameters() => initializes W1,b1 and W2,b2 arrays. W1, W2 are init with random values (Very important to have random values instead of 0), while b1,b2 are init to 0. It puts these 4 arrays in dictionary "parameters" and returns that. NOTE: To be succinct, we will use w,b to mean W1,b1,W2,b2, going forward.
  • forward_propagation() => It computes output Y hat (i.e output A2). Given X, parameters (parameters has all w,b), this func calculates Z1, A1, Z2, A2 which are stored in dictionary "cache" and returned. NOTE: here didn't use sigmoid func for both layers. Instead we used tanh function for 1st layer (hidden layer), and sigmoid for next layer (output layer). Lectures explain it why.
  • compute_cost() => computes cost (which is the log function of A2,Y).
  • backward_propagation() => This computes gradients dw1, db1, dw2, db2 by using the formulas in lecture. It stores dw1, db1, dw2, db2 in dictionary "grads". It returns dictionary "grads". NOTE: above 3 functions were combined into one as propagate() in the previous exercise from week2, but here they are separated out for clarity.
  • update_parameters() => This function computes new w,b given old w,b and dw,db. It doesn't iterate here, rather iteration is done in nn_model() below
  • nn_model() => This is the main func that will be called in our pgm. We provide the training data array (both X,Y) as i/p to this func. It then returns to us the optimal parameters (w,b). It calls above functions as shown below:
    • calls func initialize_parameters() to init w,b,
    • It then iterates thru cost function to find optimal values of w,b that gives the lowest cost. It forms a "for" loop for predetermined number of iterations. Within each loop, it calls these functions:
      • forward_propagation() => Given values of X,w,b, it computes A2(i.e Y hat). It returns A2 and cache.
      • compute_cost() => Given A2,Y, parameters (w,b), it computes cost
      • backward_propagation => Given X,Y, parameters (w,b) and cache (which stores intermediate Z and A), it computes dw,db and stores it in grads.
      • update_parameters() => This computes new values of w,b using old w,b and gradients dw,db. New "parameters" dictionary is returned.
    • In beginning, w and b are initialized. We start the loop and in first iteration, we run the 4 functions listed above to get new w,b based on dw, db, and learning rate chosen. Then we start with next iteration. In next iteration, we repeat the process with newly computed values of w,b fed into the 4 functions to get even newer dw, db, and update w,b. We keep on repeating this process for "num_iterations", until we get optimal w,b which hopefully give lot lower cost than what we started with.
    • It then returns dictionary "parameters" containing optimal W1,b1,W2,b2
  • predict() => Given input picture array X and weight w,b, it predicts Y (i.e whether point is blue or not). It uses w,b calculated using nn_model() function. It calls forward_propagation() func to get A2 (i.e Y hat). If A2>0.5, it sets predictions to "1" else sets it to 0, and returns array "predictions".
  • Accuracy is then reported for all coordinates on what color they actually were vs what our pgm predicted.

Below is the explanation of main code (after we have defined our functions as above):

  1. We get our datset X,Y from any of the multiple sets available. We have our petal flower set (which is the default set). We can also choose optional noisy_circles, noisy_moons, blobs, gaussian_quantiles. We use func loadplanar_dataset() to load petal dataset, while we use load_extra_datasets() to load the other 4 datasets. We plot the data X,Y in a scatter plot. 
  2. We then run 2 classifiers on our data: 1 is logistic regression, while other is 2 layer NN:
    1. Logistic regression:
      1. Here we run logistic regression classifier on this X,Y dataset. Instead of building our own logistic regression classifier (as we did in week 2 exercise), we use sklearn's inbuilt classifier on X,Y set.
      2. We then use func plot_decision_boundary() to plot 2D/3D decision boundary (or predicted Y values, i.e Y hat values) to check how how fitting surface looks like with logistic regression classifier. It's a a single sigmoid function as expected (with a straight line seen in 2D contour)
      3. Then we print accuracy of logistic regression which is pretty low as expected.
    2. Two layer NN:
      1. Here we run our 2 layer NN. We call function nn_model() with i/p X,Y and number of hidden layers set to 4.
      2. Next, we use func plot_decision_boundary() to plot 2D/3D decision boundary (the same way as in regression classifier)
      3. Then we print accuracy of NN which is lot higher than logistic regression.
  3. In above exercise, we used a fixed number "4" for our hidden layer number. We would like to explore what does increasing the number of hidden layers do on the accuracy of prediction. So, we repeat the same exercise as we did in 2 layer NN, but now we vary hidden layer size from 1 to 50. As expected, larger the number of hidden layers, more the number of surfaces we have to play with, and hence better the fit we can achieve. So, prediction accuracy goes to 90%.

Below are the plots for different hidden layer size (sizes ranging from 1 to 20). NOTE: number of layers is still 2.

1. Petal data: First we show plots for Petal data set

A. below is how petal data looks like. Here o/p Y is the color, while i/p X are the coordinates (x1,x2)

 

B. When we run logistic regression on above data to get best fit, this is how logistic regression final output Y plot looks like:

 

 

C. Now, we run the same datset on optimal w,b calculated in our pgm above, but with different size of hidden layer ranging from 1 to 20. Here we plot A2 (not Y, but Y hat), so that we can see what values these sigmoid plots range from (i.e did they all the way to 0 or 1, or were they stuck in between values). If we plot finally Y (predicted values), then we lose this info. As can be seen, we get more and more tanh plots to arrange and get better fit, as we increase hidden layer size. Hidden layer size of 1 means only 1 tanh function, size=2 means 2 tanh functions, size=3 means 3 tanh functions, and so on. So, for size=3, activation function A2=C1*tanh+C2*tanh+C3*tanh can generate a lot more surfaces (about 3+3+1=7 possible surfaces).

 

 

2. noisy circles data: Next we show data for Noisy circles data set

A. below is how noisy circles data looks like. Here o/p Y is the color, while i/p X are the coordinates (x1,x2).

B. When we run logistic regression on above data to get best fit, this is how logistic regression final output Y plot looks like:

 

 

C. Now, we run the same datset on optimal w,b calculated in our pgm above, but with different size of hidden layer ranging from 1 to 20. As in petals case, we plot A2 (not Y, but Y hat). Results show the same thing as petals case: we get better fit, as we increase hidden layer size. Here blue and red dots are more randomly spread, so there should be more of  tanh functions that are added together, so that they can separate out red and blue dots. So, a larger hidden layer size helps.

 

Summary:

By finishing this exercise, we learnt how to build 2 Layer NN and figure out optimal weight for coordiantes (x,y) so that it can predict blue vs red dot. We played around with different size of hidden layer, and saw that higher the size of hidden layer, better is the fit, though beyond a certain optimal number, increasing the size of hidden layers don't add any extra value. We compared results to those predicted by logistic regression. Logistic regression (which is basically a single layer NN) could never match the accuracy provided by NN with 2 layer.