python - scikit-learn

scikit-learn:

It's an open source machine learning library for python. It's built on on top of SciPy and is distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. scikit-learn is also known as sk-learn and provides simple and efficient tools for data mining and data analysis. It supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.

Offical website is:

https://scikit-learn.org/stable/

Install on CentOS 7:

scikit-learn requires:

  • Python (>= 3.6)
  • NumPy (>= 1.13.3)
  • SciPy (>= 0.19.1)
  • joblib (>= 0.11)
  • threadpoolctl (>= 2.0.0)

Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with “Display”) require Matplotlib (>= 2.1.1). So, before you install scikit-learn, you need to have Numpy, SciPy and Matplotlib installed. scikit-learn may install it for you if it finds them missing. It does install other nodules for you as well.

Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 0.23 and later require Python 3.6 or newer. As we will work with python3.6, we'll install scikit-learn 0.23-2 which is the latest version.

Cmd: run below cmd on Linux Terminal:

sudo python3.6 -m pip install -U scikit-learn

Screen messages:

We see following on screen: It first downloads scikit-learn-0.23-2, then it looks for scipy version >= 0.13.3, numpy version >= 1.8.2, and few other python modules. It downloads ones that are needed. It uninstalls ones that are older and replaces them with newer version. As as ex, below I had numpy-1.19.1 installed, but scikit-learn had latest numpy-1.19.2 version, so it uninstalled the older version, and replaced it with newer version.

  Downloading https://files.pythonhosted.org/packages/5c/a1/273def87037a7fb010512bbc5901c31cfddfca8080bc63b42b26e3cc55b3/scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)

Collecting numpy>=1.13.3 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/b8/e5/a64ef44a85397ba3c377f6be9c02f3cb3e18023f8c89850dd319e7945521/numpy-1.19.2-cp36-cp36m-manylinux1_x86_64.whl (13.4MB)

Collecting scipy>=0.13.3 (from scikit-learn)
  Using cached https://files.pythonhosted.org/packages/14/92/56dbfe01a2fc795ec92b623cb39654a10b1e9053db594f4ceed6fd6d4930/scipy-1.2.3-cp34-cp34m-manylinux1_x86_64.

Requirement already up-to-date: scipy>=0.19.1 in /usr/local/lib64/python3.6/site-packages (from scikit-learn)
Installing collected packages: joblib, numpy, threadpoolctl, scikit-learn
  Found existing installation: numpy 1.19.1
    Uninstalling numpy-1.19.1:
      Successfully uninstalled numpy-1.19.1
Successfully installed joblib-0.16.0 numpy-1.19.2 scikit-learn-0.23.2 threadpoolctl-2.1.0

Once we see above sucess message, That means scikit-learn is installed on your system. As explained in"modules" section, if the module gets installed correctly, we will see the module in below dir for python3.6:

/usr/local/lib64/python3.6/site-packages/sklearn => This is the scikit-learn dir. We also see a scikit-learn.libs dir which has *.so file (shared object library) and a scikit_learn-0.23.2.dist-info dir, which has all distribution info.

In order to check your installation and to see which version and where scikit-learn is installed, use below cmd:

> python3.6 -m pip show scikit-learn => It gives below o/p showing scikit-learn version 0.23.2 is installed


Name: scikit-learn
Version: 0.23.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location: /usr/local/lib64/python3.6/site-packages
Requires: joblib, threadpoolctl, numpy, scipy

Error Messages:

As explained in "modules" section, if we just type "pip install U scikit-learn", we'll get multiple errors (files not found, etc) as we are not running right version of pip for python 3.6. You may get any of these errors as shown below: (note that even though python3 is soft linked to python3.6, below cmds keep using python3.4. So, it's best to run pip with python3.6 as explained above, and you will get smooth installation)

numpy errors running with python3.4

Building wheels for collected packages: numpy
  Running setup.py bdist_wheel for numpy ... error
  Complete output from command /usr/bin/python3.4 ....

multiple gcc compile errors

  gcc -pthread _configtest.o -o _configtest
  _configtest.o: In function `main':
  /tmp/pip-install-r7v7kemj/numpy/_configtest.c:6: undefined reference to `exp'
  collect2: error: ld returned 1 exit status

  gcc: _configtest.c
  _configtest.c:1:20: fatal error: Python.h: No such file or directory
   #include <Python.h>

 

Usage:

import sklearn: We need to first import sklearn and other modules in any python pgm. These are the imported modules:

import numpy as np
import matplotlib.pyplot as plt
import sklearn

linear model: sklearn has built in regression models to find best fit for given data. More details here:

https://scikit-learn.org/stable/modules/linear_model.html

Linear Regression:

Here (X,Y) data is fitted using weight coefficients. Here Y may be single target, or Y may be multiple targets (i.e Y0, Y1, etc that we are trying to fit simultaneously). Usually Y is a single target for our purposes. Linear regression fits in a linear model to minimize sum of squares of error. LinearRegression will take in its fit method arrays X, y and will store the coefficients of the linear model in it's coef_ member and the bias (or intercept) in it's ntercept_ member. When y is a single target, _coeff is 1D ndarray of shape(num_of_features,), while _intercept is just a float number. When y is multiple target, then _coeff is 2D ndarray of shape(num_of_targets, num_of_features), while _intercept is 1D array of shape(num_targets,). fit(X,y) method takes in 2 arrays, where X is 2D array of shape(num of samples, multiple X attr as X0, X1, and so on). y is a 1D array of output values.

Ex: This tries to fit data (X,y) using Linear regression. y=m0X0 + m1X1 + b

from sklearn import linear_model
reg = linear_model.LinearRegression() #reg is an instance of LinearRegression class (See Object Oriented pgm)
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) #Here X has 2 attr X0, X1, and for each [X0, X1] we have y. So, for X=[0,0], Y=0. Similarly for X=[1,1], Y=1 and so on.
print(reg.coef_, reg.intercept_) => returns array([0.5, 0.5]), 0.1*e-16 . These are the 2 coeff m0 and m1,and intercept b that try to fit the data. so, y=0.5*X0 + 0.5*X1 + b for best fit. b is close to 0 (ideally it should be 0, but computers can't get exact 0). Here _coeff is 1D array while intercept_ is a float as expected

 

Logistic Regression:

This is implemented in LogisticRegression() class. This implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional , or Elastic-Net regularization. The solvers implemented in the class LogisticRegression are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”.

LogisticRegressionCV implements Logistic Regression with built-in cross-validation support, to find the optimal C and l1_ratio parameters according to the scoring attribute.

ex: It fits data (X,Y) using Logistic Regression where Y=0 or 1 for any given X. X is 2D array, while Y is 1D array, same as in previous linear regression example. The difference is that _coeff and _intercept now are diff shape matrix. _coeff is 2D ndarray of shape(1, num_of_features), while _intercept is 1D array of shape(1,).  Not sure why the matrix are higher dimensions now, even though the data for m, b that they contain is still same style as linear regression.

clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X, Y);

print(clf.coef_, clf.intercept_) => prints coefficient matrix + bias (intercept) for the mode that fits this data closest. Prints something like: coeff=[[ 0.02783873 -0.20163637]] intercept=[0.01543046]. NOTE: ceff is 2D array, while intercept is 1D array (different than Linear Regression)

LR_predictions = clf.predict(X) => We can use predict method and apply it on original X dataset to see what predicted Y array it gives out. coefficients stored in clf.coef_ are used for predict method.

LR_predict_probability = clf.predict_proba(X) => This shows the probability for each example in X dataset. It shows it as a pair, where 1st num is probability of matching, while 2nd num is probability of not matching