4.1 - Foundations of CNN

Details: Last Updated: Tuesday, 16 February 2021 23:50; Published: Tuesday, 15 December 2020 09:31; Hits: 573

Foundations of CNN - Course 4 week 1

This course goes over basics of CNN. Detecting edges is the basic motivation behind CNN. In anypicture, we want to detect horizontal and vertical edges, so that we can identify boundaries of different things in the picture.

We construct a filter (or a kernel) with some dimension, and then convolve it with a picture to get an output. The convolution operator is denoted by asterisk (*) which is the same operator that's used in multiplication. This causes confusion, but that's what has been used in Digital signal processing Convolution operations, so we use same notation for operator here. In python, func "conv_forward" does convolution, while in TF, tf.nn.conv2d does the job.

convolution just applies the operation of convolution for a given filter on all parts of the picture, one part at a time. When convolving, we just multiply element wise each entry of filter with each entry of picture, and sum them up to get a single number. See the example explained in lecture.

ex: A 6x6 matrix convolved with a 3x3 matrix gives 4x4 matrix.

Edge Detectors:

An example of vertical edge detector would be a 3x3 filter with 1st column as all1, then 2nd col as al 0, and 3rd col as all -1. This detects edges, if we associate +ve numbers with whiteness, -ve numbers with darkness, and 0 being in b/w white and black (i.e gray). We can also make a horizontal detector, by switching rows with columns, i.e 1st row is all 1, 2nd row is all 0, and 3rd row is all -1.

Instead of hard coding these 9 values in a edge detector filter, we can define them as 9 parameters: w1 to w9, and let NN pick up the most optimal numbers. Back propagation is used to learn these 9 parameters. This gives the best results.

Padding and Striding:

Valid Conv: Here o/p matrix shape is not same as i/p matrix shape.

A nxn picture convolved with fxf filter gives matrix with dimension (n-f+1) x (n-f+1). That's why 6x6 matrix convolved with 3x3 filter gave 4x4 o/p (as n=6, f=3, so o/p = 6-3+1=4)

Same conv: To keep the dimension of o/p the same as that of i/p pic, we can use padding, where we pad picture border with extra pixels on the boundary of pic. This involves adding row or col of 0 or 1 or some other value. We can choose padding number p such that the o/p matrix dim remain same as that of i/p pic.

With padding p, a n x n picture (padded with p pixels on each side of pic on border) convolved with f x f filter gives matrix with dimension (n+2p-f+1) x (n+2p-f+1). That's why 6x6 matrix (with p=1) convolved with 3x3 filter gives 6x6 o/p (as n=6, p=1, f=3, so o/p = 6+2-3+1=6). so, o/p matrix retains same shape as i/p matrix.

For any general shape of i/p matrix, we have to choose p such that o/p matrix shape is same as i/p matrix shape. For that to happen, n+2p-f+1 = n => p=(f-1)/2. So, for filter of size=3, we have to choose p=(3-1)/2=1.

With padding, we increase the size of o/p matrix. Striding does opposite of that where it reduces the size of o/p matrix. Striding is where we jump by more than 1 when calculating conv for adjoining boxes. So far, we used a stride of 1 for all our conv, but we could have used any stride number as 2, 3, etc. We do this stride or skipping in both horizontal and vertical directions.

With stride s, a n x n picture (padded with p pixels on each side of pic on border, and stride s) convolved with f x f filter gives matrix with dimension floor((n+2p-f)/s+1) x floor((n+2p-f)/s+1). We use floor function incase numbers don't divide to give an integer.

By using padding and striding together, we can do "same conv".

Convolution over Volume:

So far we have been doing conv over 2D matrix. We can extend this concept to do conv over volume (i.e 3D matrix). In such a case, the i/p matrix is 3D (where the 3rd dimension is for channel, i.e each 2D matrix is for separate color R, G, B). The filter is also 3D. The o/p matrix in such a case is still 2D with same dim as before (n-f+1) x (n-f+1) (assuming p=0, and s=1).

Conv over volume is same as that over area: multiplication and addition is done over all elements including the 3rd dim. So, o/p returned for each conv operation is still a single value for one given box.

However, if we have more than 1 filter for conv operation (i.e one filter is for vertical edge detection, while other filter is for horizontal edge detection, and so on), then the o/p matrix becomes a 3D matrix.

For N filters being applied on i/p pic with dim n x n x n_c and filter with dim f x f x n_c , the o/p matrix shape would be (n-f+1) x (n-f+1) x N.

Note that n_c which is the number of channels in the i/p has to be the same for the filter.

Ex: An i/p pic of 6x6x3 conv with 2 filters of shape 3x3x3 gives o/p matrix of shape 4x4x2 (since n=6, f=3, n_c=3 and N=2)

1 Layer of CNN:

For CNN also, we have multiple layers as in Deep NN. In Deep NN, for each layer, we compute activation func a^[l]=g(z^[l]) where g is the function used for that layer and z^[l] = w^[l] *a^[l-1] + b^[l] (* here means matrix multiplication).

In CNN, for each layer, we compute convolution instead of matrix multiplication. So, for i/p layer a^[0], z^[1] = w^[1] *a^[0] + b^[1] where w^[1] is the filter matrix, and b^[1] is the offset added as before. Here asterisk * refers to convolution operation. Then we use activation function as ReLU, sigmoid, etc to compute o/p matrix a^[1]=g(z^[1]) . This is true even if we have more than 1 filter, our weight matrix will just have one extra dim for the number of filters.

In general for each layer "l" , we have following relation:

f^[l] = filter size

p^[l] = padding size

s^[l] = stride size

n_c^[l] = number of filters. Each filter is of dim f^[l] x f^[l] x n_c^[l-1]

dim for "l"th i/p layer a^[l-1] = n_h^[l-1] x n_w^[l-1] x n_c^[l-1] where n_h = number of pixels across height of pic, n_w = number of pixels across width of pic, n_c = number of color channels of pic (for RGB, we have 3 channels),

dim for "l"th o/p layer a^[l-1] = n_h^[l] x n_w^[l] x n_c^[l] where n_h^[l] = floor( (n_h^[l-1] + 2p^[l]- f^[l])/s^[l] + 1 ) , n_w^[l] = floor( (n_w^[l-1] + 2p^[l]- f^[l])/s^[l] + 1 )

For m examples, A^[l-1] = m x n_h^[l] x n_w^[l] x n_c^[l]

dim of weight matrix w^[l] = f^[l] x f^[l] x n_c^[l-1] x n_c^[l], where n_c^[l], is the number of filters in layer "l"

dim of bias matrix b^[l] = 1 x 1 x 1 x n_c^[l] => bias is a single number for each filter, so for n_c^[l] filters, we have n_c^[l] parameters.

Example of Conv NN: provided in lecture

3 Types of layers in conv NN: Just using convolution layers may suffice to give us good results, but in practise, supplementing CONV layers with POOL layers and FC layers results in better results.

convolution layer (CONV): This is about using the convolution operator.
Pooling layer (POOL): This is about using max or avg of a subset of matrix, so as to reduce the size of matrix.
Fully connected layer (FC): This is similar to conventional NN, where we connect each i/p entry to each o/p entry which results in a lot of weights being used. But since we use the FC feature in the last few stages of the NN, the size of matrix is greatly reduced by that time, resulting in fewer entries in weight matrix.

Reasons for using Conv NN: (see in last lecture)

Parameter sharing: Same conv filter can be used at multiple places in the image
Sparsing of connections: Not each i/p needs to be connected to o/p, since most of the o/p only depend on a subset of i/p matrix.

Finding optimal values of Weights:

We use same technique of gradient descent to find the lowest value of cost given different weight matrix, and filters. The derivation is not shown in programming assignment, but look in my hand written notes.

Assignment 1:

Assignment 2:

Nav view search

Navigation

Search

4.1 - Foundations of CNN