Course 3 - Structuring ML projects

Course 3 - Structuring ML projects

This course is a slight departure from technical discussion of NN. It discusses several techniques when dealing with ML projects. It has no pgm assignments. It has only 2 sections. Both sections are theoretical only, and can be finished in 3-4 hours. Even if you skip this course all together, you won't miss much. I've summarized the lectures below:

1. ML strategy 1: This talks about following:

A. Orthogonalization: This refers to choosing orthogonal knobs to control or improve certain aspects of your NN.

B. Single number evaluation metric: We should use a single metric to evaluate and compare performance across different ML algo

C. Satisficing and Optimizing metric: Out of different metrics we use to evaluate our algo, some of the metrics may be classified as "satisficing" metric, where you just need to satisfy those metric (i.e the performance needs to meet certain threshold for those metric). Other metrices may be classified as "optimizing metric" where we really want our algo to optimize for that metric.

D. distribution: distribution of data in train set, dev set and test set should be similar, otherwise our algo may perform badly on sets where there is data which is vastly different from trained data.

E. size of train/dev/test set: In big era data where we have millions of training data, we usually divide available data to have 98% training data, 1% dev data and 1% test data. We can do this as even 1% is 10K data point, which is large enough to work as dev/test set.

F. weights: Sometimes we may want to assign different weighing to different loss terms. i.e there may be cases where we want to assign a much larger weight to a loss term where a elephant pic is identified  as cat pic, but much lower weight to loss term if a bobcat is identified as a cat. This can be done by multiplying loss term with the weight term and then summing the product. To normalize the sum, we then divide it by the sum of the weights (instead of dividing it by the number of examples). This weight is different than the weights we use to optimize our loss.

G. Human level performance: All ML algo strive to reach human level performance. Bayes error is the lowest error that you can get and for Computer vision, human error is pretty close to Bayes error. So, once you get your ML algo to get to human level performance, you are pretty close to lowest error that's possible. It's very hard to get even incremental improvements to error once you reach human level.

Difference b/w human error and Training set error is called "avoidable bias", as that error gap can be brought closer to 0. The gap b/w training error and dev/test set error is called variance. Both "avoidable bias" and "variance" may be a problem for our ML project, so we have to be careful on which one to target more to get most lowest error on our dev/test set. "Avoidable bias" can be reduced by choosing larger training model (deeper NN), usin better algo as Momentum, RMS prop, Adam, etc. To reduce "varaince", we can use larger training set, use regularization techniques as L2, drop out, etc.

2. ML strategy 2: This talks about following:

A. Analyzing error: It's important to analyze your error, i.e all cat pics that were misclassified. Once we start categorizing these errors in different buckets, we can start seeing exactly where is our ML not working as expected. Sometimes the o/p label itself is incorrect (i.e a cat pic is incorrectly labeled as "non cat" pic). This may or may not be worth it to fix, depending on how severe the issue is. We also have to make sure that our training data and test/dev data come from same distribution, else there will be lot of variance. One way to find out if variance is due to mismatched data b/w training and dev set is to carve out a small percentage of training data as train-dev set, and not use this portion of data as training set, but use it as dev set. If the variance is small on this train-dev set, but large on dev set, than that indicates mismatc b/w train data and dev/test data. To address data mismatch, one other solution is to include as much varied data as possible on the training set, so that ML system is able to optimize across all such data.

B. Build system quickly and then iterate: It's always better to build a barely working system quickly, and thn iterate a lot to fins tune the ML system to reduce errors.

C. Transfer learning: This is where we use a model developed for one ML project, into some other project with minimal changes. This is usually employed in cases, wher we have very little training data to train our ML algo, so we use parameters developed for some other ML project, and just change the o/p layer parameters, or parameters for last couple of layers. This allows us to get very good performance. For ex in radio image diagnosis, NN developed for image recognition may be used, since both applications are similar.

D. Multi task learning: This is where we use same model to do multiple things instead of doing one thing. An ex is autonomous car, where the image recognition model needs to identify images of cars, pedestrians, stop signs, etc all at same time. Instead of building separate NN for each of them, we can build a single NN with many different o/p values, where each o/p value is for a specific task as other car, pedestrian, stop sign, etc.

E. End to End Deep Learning: This is where a NN can take an i/p and give an o/p w/o requiring intermediate steps to do more processing. As an ex, translating audio clip to text, traditionally required many steps of complex pipelining to get it to work. But with large amounts of big data, deep NN just learned from i/p data to produce translation w/o requiring any intermediate pipeline. Sometimes we do divide the task in 2-3 intermediate steps before we implement DL on it, as that performs better. We have both kind of real life examples, where End to End DL works better, as well as cases where breaking it down into 1-2 smaller steps works better.