python - scikit-learn
- Details
- Last Updated: Sunday, 04 October 2020 15:53
- Published: Thursday, 17 September 2020 19:26
- Hits: 1526
scikit-learn:
It's an open source machine learning library for python. It's built on on top of SciPy and is distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. scikit-learn is also known as sk-learn and provides simple and efficient tools for data mining and data analysis. It supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities.
Offical website is:
https://scikit-learn.org/stable/
Install on CentOS 7:
scikit-learn requires:
- Python (>= 3.6)
- NumPy (>= 1.13.3)
- SciPy (>= 0.19.1)
- joblib (>= 0.11)
- threadpoolctl (>= 2.0.0)
Scikit-learn plotting capabilities (i.e., functions start with plot_ and classes end with “Display”) require Matplotlib (>= 2.1.1). So, before you install scikit-learn, you need to have Numpy, SciPy and Matplotlib installed. scikit-learn may install it for you if it finds them missing. It does install other nodules for you as well.
Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4. scikit-learn 0.23 and later require Python 3.6 or newer. As we will work with python3.6, we'll install scikit-learn 0.23-2 which is the latest version.
Cmd: run below cmd on Linux Terminal:
sudo python3.6 -m pip install -U scikit-learn
Screen messages:
We see following on screen: It first downloads scikit-learn-0.23-2, then it looks for scipy version >= 0.13.3, numpy version >= 1.8.2, and few other python modules. It downloads ones that are needed. It uninstalls ones that are older and replaces them with newer version. As as ex, below I had numpy-1.19.1 installed, but scikit-learn had latest numpy-1.19.2 version, so it uninstalled the older version, and replaced it with newer version.
Downloading https://files.pythonhosted.org/packages/5c/a1/273def87037a7fb010512bbc5901c31cfddfca8080bc63b42b26e3cc55b3/scikit_learn-0.23.2-cp36-cp36m-manylinux1_x86_64.whl (6.8MB)
Collecting numpy>=1.13.3 (from scikit-learn)
Downloading https://files.pythonhosted.org/packages/b8/e5/a64ef44a85397ba3c377f6be9c02f3cb3e18023f8c89850dd319e7945521/numpy-1.19.2-cp36-cp36m-manylinux1_x86_64.whl (13.4MB)
Collecting scipy>=0.13.3 (from scikit-learn)
Using cached https://files.pythonhosted.org/packages/14/92/56dbfe01a2fc795ec92b623cb39654a10b1e9053db594f4ceed6fd6d4930/scipy-1.2.3-cp34-cp34m-manylinux1_x86_64.
Requirement already up-to-date: scipy>=0.19.1 in /usr/local/lib64/python3.6/site-packages (from scikit-learn)
Installing collected packages: joblib, numpy, threadpoolctl, scikit-learn
Found existing installation: numpy 1.19.1
Uninstalling numpy-1.19.1:
Successfully uninstalled numpy-1.19.1
Successfully installed joblib-0.16.0 numpy-1.19.2 scikit-learn-0.23.2 threadpoolctl-2.1.0
Once we see above sucess message, That means scikit-learn is installed on your system. As explained in"modules" section, if the module gets installed correctly, we will see the module in below dir for python3.6:
/usr/local/lib64/python3.6/site-packages/sklearn => This is the scikit-learn dir. We also see a scikit-learn.libs dir which has *.so file (shared object library) and a scikit_learn-0.23.2.dist-info dir, which has all distribution info.
In order to check your installation and to see which version and where scikit-learn is installed, use below cmd:
> python3.6 -m pip show scikit-learn => It gives below o/p showing scikit-learn version 0.23.2 is installed
Name: scikit-learn
Version: 0.23.2
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-learn.org
Author: None
Author-email: None
License: new BSD
Location: /usr/local/lib64/python3.6/site-packages
Requires: joblib, threadpoolctl, numpy, scipy
Error Messages:
As explained in "modules" section, if we just type "pip install U scikit-learn", we'll get multiple errors (files not found, etc) as we are not running right version of pip for python 3.6. You may get any of these errors as shown below: (note that even though python3 is soft linked to python3.6, below cmds keep using python3.4. So, it's best to run pip with python3.6 as explained above, and you will get smooth installation)
numpy errors running with python3.4
Building wheels for collected packages: numpy
Running setup.py bdist_wheel for numpy ... error
Complete output from command /usr/bin/python3.4 ....
multiple gcc compile errors
gcc -pthread _configtest.o -o _configtest
_configtest.o: In function `main':
/tmp/pip-install-r7v7kemj/numpy/_configtest.c:6: undefined reference to `exp'
collect2: error: ld returned 1 exit status
gcc: _configtest.c
_configtest.c:1:20: fatal error: Python.h: No such file or directory
#include <Python.h>
Usage:
import sklearn: We need to first import sklearn and other modules in any python pgm. These are the imported modules:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
linear model: sklearn has built in regression models to find best fit for given data. More details here:
https://scikit-learn.org/stable/modules/linear_model.html
Linear Regression:
Here (X,Y) data is fitted using weight coefficients. Here Y may be single target, or Y may be multiple targets (i.e Y0, Y1, etc that we are trying to fit simultaneously). Usually Y is a single target for our purposes. Linear regression fits in a linear model to minimize sum of squares of error. LinearRegression will take in its fit
method arrays X, y and will store the coefficients of the linear model in it's coef_ member and the bias (or intercept) in it's ntercept_ member. When y is a single target, _coeff is 1D ndarray of shape(num_of_features,), while _intercept is just a float number. When y is multiple target, then _coeff is 2D ndarray of shape(num_of_targets, num_of_features), while _intercept is 1D array of shape(num_targets,). fit(X,y) method takes in 2 arrays, where X is 2D array of shape(num of samples, multiple X attr as X0, X1, and so on). y is a 1D array of output values.
Ex: This tries to fit data (X,y) using Linear regression. y=m0X0 + m1X1 + b
from sklearn import linear_model reg = linear_model.LinearRegression() #reg is an instance of LinearRegression class (See Object Oriented pgm) reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) #Here X has 2 attr X0, X1, and for each [X0, X1] we have y. So, for X=[0,0], Y=0. Similarly for X=[1,1], Y=1 and so on. print(reg.coef_, reg.intercept_) => returns array([0.5, 0.5]), 0.1*e-16 . These are the 2 coeff m0 and m1,and intercept b that try to fit the data. so, y=0.5*X0 + 0.5*X1 + b for best fit. b is close to 0 (ideally it should be 0, but computers can't get exact 0). Here _coeff is 1D array while intercept_ is a float as expected
Logistic Regression:
This is implemented in LogisticRegression() class. This implementation can fit binary, One-vs-Rest, or multinomial logistic regression with optional , or Elastic-Net regularization. The solvers implemented in the class LogisticRegression
are “liblinear”, “newton-cg”, “lbfgs”, “sag” and “saga”.
LogisticRegressionCV
implements Logistic Regression with built-in cross-validation support, to find the optimal C
and l1_ratio
parameters according to the scoring
attribute.
ex: It fits data (X,Y) using Logistic Regression where Y=0 or 1 for any given X. X is 2D array, while Y is 1D array, same as in previous linear regression example. The difference is that _coeff and _intercept now are diff shape matrix. _coeff is 2D ndarray of shape(1, num_of_features), while _intercept is 1D array of shape(1,). Not sure why the matrix are higher dimensions now, even though the data for m, b that they contain is still same style as linear regression.
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X, Y);
print(clf.coef_, clf.intercept_) => prints coefficient matrix + bias (intercept) for the mode that fits this data closest. Prints something like: coeff=[[ 0.02783873 -0.20163637]] intercept=[0.01543046]. NOTE: ceff is 2D array, while intercept is 1D array (different than Linear Regression)
LR_predictions = clf.predict(X) => We can use predict method and apply it on original X dataset to see what predicted Y array it gives out. coefficients stored in clf.coef_ are used for predict method.
LR_predict_probability = clf.predict_proba(X) => This shows the probability for each example in X dataset. It shows it as a pair, where 1st num is probability of matching, while 2nd num is probability of not matching