Statistics - Regression

Regression analysis:

This is a term that is used extensively in AI, and is the starting point in AI. In statistics, regression analysis is a set of statistical processes for estimating relationship b/w a dependent variable (also commonly called outcome variable) and one or more independent variables (often called 'predictors', 'covariates', or 'features'). For ex: heart attack vs weight. Here heart attack is dependent var (on Y axis), which depends on weight, an independent var (on X axis). Here, we are trying to find a relationship b/w the 2, and see if they are related. i.e does higher weight causes more heart attack, etc.

Correlation Coefficient (R):  R is a correlation coeff that measures how well X,Y in given dataset are correlated, i.e if X changes by a certain amount, does Y also change by a proportional amount. The correlation of 2 random variables X and Y is the strength of the linear relationship between them. It's a number b/w -1 and +1 (-1 meaning perfect -ve correlation, while +1 meaning perfect +ve correlation, and 0 meaning no correlation).

There are many types of correlation coeff, but the most commonly used is Pearson's correlation coeff (rep by "r" or "R"). To measure R mathematically, we define it as follows

Pearson's r = R = Correlation (X,Y) = Cov(X,Y) / (σ(X) * σ(Y)) => Correlation exhibites same properties as covariance, as it is defined the same way. However we divide it by std debviation terms to normalize it, so that correlation remains b/w -1 to +1. See statistics section for definition of variance and covariance.

The most common form of regression analysis is Linear Regression. A special case of Linear Regression is logistic regression.

 


 

Linear Regression:

Linear Regression is a linear approach used in statistics to model a relationship b/w o/p response (dependent var Y) and i/p parameter (independent explanatory var x0, x1, x2 ...). In simple terms, it's a X,Y plot, where numerous (X,Y) data points are given. Our goal is to find an eqn that very closely fits all the data point. This is linear approach, so data is fitted with a linear line (Y = mX + b), and loss or error is calculated by taking the squares of difference for each data point. Minimizing this loss gives us the best fit, and is called "least squares" approach to fit models to data. Liner fitting or linear regression is the simplest approach, and works well, so it's very widely used.

NOTE: We need both Fitting func and Error func. W/O defining Error func, we have no definitive way to quantify how well our fitting func fitted with the data. Genrally by getting insight into the fitting func, we are able to come up with an error func. Finding the Fitiing func is the harder part.

There are 2 kinds of Linear regression.

1. Simple linear regression:  Here there is only one explanatory var on which o/p response depends. Let's say weight of a person depends on his height, then we can have Y(weight of person) plotted against X(height of person). We'll collect lot of (X,Y) data, plot it, and then do a best linear fit, by drawing a line Y=mX+b thru that data. This is simple linear regression

2. Multiple linear regression:  Here there is more than one explanatory var on which o/p response depends. Let's say in above ex, weight of a person depends on race along with his height. Then we have 2 explanatory var (height and race) on which o/p Y (weight) depends. We'll collect lot of (X0 , X1 ,Y) data, plot it, and then do a best linear fit, by drawing a line  Y=m0X0 + m1X1 + b (Here X0 is height and X1 is weight) thru it. This is a 3D plot (i.e equation of a plane in 3 var) where there are 3 axis, X, Y, Z. where X,Y are two i/p axis, and Z is o/p axis. Similarly it's n dimensional plot (eqn of a plane in n dimensions) for "n" i/p var. Since it's eqn of a plane, it flat and can't be zig-zag, so if data is zig-zag, it may not fit very well. NOTE: here we don't have b0 , b1, separately since all of them can be clubbed into 1 var b (as b= b0 ,+ b1 + ...)

Error func:

We choose our error func for best fit to be something that sums up the differences b/w the actual and predicted value. We take squares, since we want to treat both +ve and -ve differences to be treated as errors, an NOt to cancel each other out. So, our error func to determine best fit is ( Ygiven - Ypredicted )^2 and we try to minimize it. We use calculus to come up with values of m, b to minimize this error. Mean Square error (MSE) measures the mean of this square by dividing it by number of samples. We divide it by "m" to get avg error, so that error func doesn't keep on going up as we increase the number of samples. Root mean square error (RMSE) is taking the root of MSE so that the units of RMSE are same dimension as those of Y.

Coefficient of determination (R2 or  r2 or R-squared): Above we saw that R specifies extent of linear relationship b/w var X,Y. R2 is other term used to specify goodness of fit for a model. It has multiple definitions. Most widely used is that it is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). This link explains it nicely: http://danshiebler.com/2017-06-25-metrics/

Pearson's r2 (square of regression coefficient) is simply the square of pearson's r value, and is a number b/w 0 and 1. It's not the same as R2 that we talked above, but in special cases, it becomes the same.

R2 is most commonly used in regression analysis, where we are more interested in how well our predicted data fits with the actual data. This R-squared is different than "squared error" we talked above. R2 measures how much better does the data fit compared to the mean. it is a number b/w 0 and 1, and is calculated as follows:

R2 = 1-  (MSE / variance(Y)) =>  Variance(Y) is the mean line we draw which doesn't change at all with change in X. So, this is the worst fit that we can do, where it returns mean value of Y for any given X. Since MSE is always going to be smaller than variance(Y), our R2 value = 0 to 1. In worst case, we may have Y_predicted to be the same as mean line, so R2 = 0 (very bad correlation of predicted data), while in best case, Y_predicted is exactly same as Y_given, implying R2 = 1 (very good correlation of predicted data), Hypothetically R2  can be -ve infinity to 1, since we can always do worse than mean prediction by choosing a line for Y_predicted which is going in opposite direction to the real direction. However, this would be intentional. We choose mean as the worst case, since we can always choose mean line as our starting point, and see if we can do better than mean, else we stay at mean line for our predicted values too. From the formula, we can see why we call it square (since we don't take square root, but instead keep square terms in both numerator and denominator).

R2 = (variance(Y) - variance(Y_predicted)) / variance(Y) = 1 - ( variance(Y_predicted) / variance(Y)) => This is another formula for R2 and boils down to same formula as above. FIXME? Above link explains it, but couldn't figure out yet?

R2=0.25 means that only 25% of the original variation in the data is explained by the given relationship, other 75% of variation is going from some other relationship that we don't know yet. So, here correlation is weak (in other words we are only 25% in between mean and perfect fit). On the other hand, R2=0.9 means that 90% of the original variation in the data is explained by the given relationship, so correlation is very strong.

Higher order non linear eqn: Above we used linear regression, which used eqn Y=f(X). But for a better fit, We could go to higher order eqn (i.e Y=f(X2), Y=f(X3), ...) and those will fit the plot better, but those get more complex. Turns out that higher order eqn that look non linear, are actually linear regression. Let's say we have 2 i/p var, X0 and X1. In linear regression Y=f(X0 , X1). However, if we go to 2nd order eqn, then Y=f(X0 , X1,  (X0)^2, (X1)^2, X0*X1). If we choose X2=(X0)^2, X3=(X1)^2, X4=X0*X1, then the eqn ican be expressed as Y=(X0 , X1,  X2, X3, X4). This eqn is linear eqn, it just happens to have 5 i/p var, instead of 2 that we had before. So, Higher order non linear eqn can be treated as Multiple linear regression for all analysis.However, be careful as these higher order eqn may cause overfitting, and may not represent real world effect.

 


 

Logistic Regression:

Logistic regression (or Logit) is a special case of linear regression. In many textbooks, it's not even referred to as logistic regression, but rather as logistic classification. Here the o/p Y can't have infinite values (i.e Y is not a continuous function) but can only have a certain number of distinct possibilities. I.e there may be only 4 outcomes as a shape is a square, circle, rectangle or triangle. However, there are key differences b/w linear and logistic regression. One is that logistic regression predicts the probability of particular outcomes rather than the outcomes themselves, so they are restricted to values 0 to 1. So, o/p Y reps probability of that event happening for given X. Second is that conditional distribution is a "Bernoulli distribution" rather than a "Gaussian distribution" because the dependent variable is binary. FIXME Not sure how ??

Logistic regression is very nicely explained on StatQuest: https://www.youtube.com/watch?v=yIYKR4sgzI8

There are multiple types of logistic regression:

1. Binomial or binary logistic regression: They deal with situations in which the observed outcome for a dependent variable can have only two possible types, "0" and "1" (which may represent, for example, "dead" vs. "alive" or "win" vs. "loss").

2. Multinomial logistic regression: They deal with situations where the outcome can have three or more possible types (e.g., "disease A" vs. "disease B" vs. "disease C") that are not ordered

Since in Logistic regression, Y is not continuous but has distinct values, we don't try to fit Y, but instead try to fit the probability of Y employing the same curve fitting methods , 

Linear model: Y=m0X0 + m1X1 + b => Linear model with 2 predictors X0,  X1

Logistic Model: L = logb(p/(1-p)) = m0X0 + m1X1 + c, where p = P(Y=1), L=log odds of event that Y=1. => Logit model with 2 predictors X0,  X1. Logistic model predicts probability p that Y=1 for given X. It assumes linear relationship between predictors X0,  X1 and log odd of event that Y=1. Base b is usually taken as "e", but sometimes base 10 and 2 are used too. I changed the Y intercept to c here, so as not to confuse with base b.

NOTE: we are doing probability of "odds of event that Y=1", and NOT "probability of event that Y=1". Odds of event that Y=1 is "probability of event that Y=1 / probability of event that Y≠1". So, if P(Y=1) =0.5, then P(Y≠1)=0.5, so odds of event that Y=1 is 1 and NOT 0.5, implying that odd of Y=1 is same as odd of Y≠1. If P(Y=1)=0.8, then P(Y≠1)=0.2, so odds of event that Y=1 is 0.8/0.2 = 4, implying that odd of Y=1 is 4 times the odd of Y≠1.

Now, the question is why do we do logistic regression this way, why don't we just use the same fitting methodology as Linear Regression, i.e why not do "Y=m0X0 + m1X1 + b" , instead of doing "logb(p/(1-p)) = m0X0 + m1X1 + b". The reason is because such linear line will never be able to achieve a good curve fit, as it will have to run thru the middle to fit data. A portion of the data is saturated on lower end, while the remaining portion is saturated on higher end. No matter what slope or C-intercept we choose, the error will be enormous, as line will always run thru middle to minimize error, basically always predicting that the value is 0.5 or so (which makes no sense). For binary classification, we need something non-linear like S shape, which will fit values with least error. Also, we need probability here as we'll never be able to get 0 or 1 values when predicting Y, i.e if we try to predict using Y=m0X0 + m1X1 + b, then for given X, we may get Y=0.7, but that value is meaningless as Y is either 0 or 1. But if we use probability here, then p(Y=1) for a given X makes more sense, as p(y=1) = 0.7 means there is a 70% chance that Y=1 for given X.

If we choose base b=e, and solve for p=P(Y=1) using the eqn above, we get a sigmoid function . i.e p = σ(z) = 1 / (1 - e^(-z)) where z = m0X0 + m1X1 + b. So, it's easier to understand if we assume that sigmoid func came first. It constrained our o/p values to b/w 0 and 1, gave a probability func, and that fitted our requirement well. So, by taking sigmoid of our predicted value, we ended getting that log func of p/(1-p).

This sigmoid is the standard logistic func used to fit data. It gives us the probability of o/p Y for a given i/p X, rather than giving the value of o/p Y. So, our Y data (which is the probability) is always between  0 to 1.

We could have chosen some other eqn too, i.e "logb(p) = m0X0 + m1X1 + b", however that may give a worse fit. It's just a conjecture, I don't know that. Assuming sigmoid is the best function  fitting our requirement, let's calculate the error.

Error func:

Now the question that comes to mind is how do we calculate error for this function to get best fit. Can we do "residual square" method used in linear regression? Turns out that if we do residual square, we end up getting non convex graph with many local minima for Logistic regression. For linear regression, we ended up getting a beautiful convex graph, that had 1 local mimima, so it was easy to find lowest cost. This link explains it very well.

https://towardsdatascience.com/optimization-loss-function-under-the-hood-part-ii-d20a239cde11

As explained in the link, a better function to minimize error is Ygiven * Log( Ypredicted ) +  (1-Ygiven ) * Log( 1-Ypredicted )

If we plot this error function using desmos.com graphing utility, we'll see that the function is a parabola (like umbrella), with 0 at both ends (at x=0 and x=1). It reaches a max value around x=0.5. Here, we chose  Ygiven  to be the same as Ypredicted. So, when both are the same value (i.e both are 0 or both are 1), that means, we predicted perfectly and error func is 0 (as seen at the 2 ends). Anywhere in between, even if both  Ygiven  and Ypredicted  are both same (i.e 0.5, etc), the error func will throw out a non zero value. this is OK as we never have Ygiven to be anything other than 0 or 1. The respective terms, either Log( Ypredicted )  or  Log( 1-Ypredicted ) will take over when Ygiven = 1 and Ygiven = 0 respectively. This will take the error value to very large numbers (to infinity if we predict totally opposite value of what Y is supposed to be. So, the algorithm will try to stay away from predicting totally opposite values.  This is exactly how we wanted our error func to lo behave as.

The above eqn is what we use in all logistic regression as our error function that we try to minimize. We calculate error for each sample using above eqn, sum them up and try to minimize that sum. For logistic regression, we call our cost approach "maximum likelihood", instead of "residual square".

 


 

AI logistic regr example

M pictures with 1 pixel value each. Let's say we have m pictures, each with 1 pixel (each pixel has a value from 0 to 255 representing 256 possible colors), and we try to plot that pictures popularity based on that pixel value. So, on X axis, we will have these pixel values, and on Y axis, their popularity number. We can do simple linear regr, and plot a best fit line: Y=mX+b.

However, if we have 2 pixel values for each pic, then this becomes Multiple linear regr, and best fit plane becomes Y=m0X0 + m1X1 + b. Similarly if we had nx pixel values, then, we would have  Y=m0X0 + m1X1 + ... + mnX+ b as the best fit plane. This is exactly what we do in AI in finding best fit. We call these slopes (m) as weights as w0, w1, ... and so on.

Gradient Descent (GD):

This method is used to find plane with best fit. A very good video about gradient descent is here: https://www.youtube.com/watch?v=sDv4f4s2SB8

Finding these weights or slopes to minimize the error when fitting the plane to the data is a difficult problem. However, calculus comes to our rescue here, and gives us "gradient descent" method, that allows up to find such a plane (weights w0, w1, ... ), so that total error across all X is minimized. It works amazingly well (like magic) !!

Andrew Ng's Course on coursera.org called "Supervised Machine Learning: Regression and Classification" talks about gradient descent, and has labs on it. Try it, if you want to know the basics of Gradient descent. Instead of doing gradient descent, we could also just find the slope of the cost function and equate it to 0 to find the minima. GD allows you to see what's going on, Also, computers won' t be able to just solve the derivative of cost function and equate it to 0. Instead, they can always do GD to come to a point where derivative is close to 0. So, that's why we learn and implement GD in computer programs.