23. Scikit Learn#

23.1. Installation#

https://scikit-learn.org/stable/install.html#installation-instructions

23.2. Overview#

Scikit Learn has modules for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. We have already seen preprocessing and dimensionality reduction examples when we looked at PCA.

Practice (follwoing, https://scikit-learn.org/stable/getting_started.html)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)
RandomForestClassifier(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.predict(X)  # predict classes of the training data

array([0, 1])
clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data

array([0, 1])

Note that we are able to see that [4, 5, 6] is more similar to [1,2,3] than [11, 12, 13] and therefore gets labeled 0.

23.3. Train Test Split#

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Above we needed a training (or fitting) data set along with a testing (or predicting) data set. Scikit Learn has a method to help split a single data set into these two groups.

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
print(X)
list(y)
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

[0, 1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)
print(X_train, y_train,
      X_test, y_test)
[[4 5]
 [0 1]
 [6 7]] [2, 0, 3] [[2 3]
 [8 9]] [1, 4]

24. Pipelines#

In machine learning, you will need to

  1. Load data

  2. Reject outliers/clean data

  3. Create training and testing sub-sets

  4. Train

  5. Test

  6. Itterate 4 & 5 to find the optimal model hyper-parameters.

The Training step can have many substeps. In a simple example of no-outliers, it can include any data cleaning/rescaling. It would be benifitial to create an object that does all these steps first to fit the model parameters (4) then apply the model prameters as predictions (5). Scikit Learn has a pipeline module for this.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# create a pipeline object
pipe = make_pipeline(
    StandardScaler(),
    LogisticRegression()
)
# load the iris dataset and split it into train and test sets
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# fit the whole pipeline
pipe.fit(X_train, y_train)
# we can now use it like any other estimator
accuracy_score(pipe.predict(X_test), y_test)

0.9736842105263158

See how the scaling and logistic regression are both done the same way when fitting and predicting. This is more important when your process increases in complexity.

24.1. Model Hyper-parameter Optimization#

Scikit Learn also has a module for step 6, optimizing the hyper parameters. You can see in this example that RandomizedSearchCV will iterate through the distributions given in param_distributions to look for an optimal set of parameters. Rembember that randint is an object and not simply a list.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from scipy.stats import randint
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# define the parameter space that will be searched over
param_distributions = {'n_estimators': randint(1, 5),
                       'max_depth': randint(5, 10)}
# now create a searchCV object and fit it to the data
search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
                            n_iter=5,
                            param_distributions=param_distributions,
                            random_state=0)
search.fit(X_train, y_train)
print(search.best_params_)

# the search object now acts like a normal random forest estimator
# with max_depth=9 and n_estimators=4
print(search.score(X_test, y_test))
{'max_depth': 9, 'n_estimators': 4}
0.7353489874098169
randint(1,5).rvs()

3

24.2. Further Reading#

%load_ext watermark
%watermark -untzvm -iv -w
Last updated: Fri May 02 2025 14:39:53CDT

Python implementation: CPython
Python version       : 3.12.10
IPython version      : 9.2.0

Compiler    : Clang 16.0.0 (clang-1600.0.26.6)
OS          : Darwin
Release     : 24.4.0
Machine     : arm64
Processor   : arm
CPU cores   : 12
Architecture: 64bit

rich      : 14.0.0
sklearn   : 1.6.1
numpy     : 2.1.3
matplotlib: 3.10.1
pandas    : 2.2.3

Watermark: 2.5.0