{ "cells": [ { "cell_type": "markdown", "id": "de6912c1-1f92-4b34-85a2-fcbf4cfa4ec8", "metadata": {}, "source": [ "# Scikit Learn\n", "\n", "## Installation\n", "\n", "https://scikit-learn.org/stable/install.html#installation-instructions\n", "\n", "## Overview \n", "\n", "Scikit Learn has modules for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. We have already seen preprocessing and dimensionality reduction examples when we looked at [PCA](../unit2/pca).\n", "\n", "Practice (follwoing, https://scikit-learn.org/stable/getting_started.html)" ] }, { "cell_type": "code", "execution_count": 1, "id": "300e29ed-a200-4967-963b-a4122219ba8d", "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 2, "id": "53584565-fbd6-4d6f-bae2-d204ca89afe2", "metadata": {}, "outputs": [], "source": [ "clf = RandomForestClassifier(random_state=0)\n", "X = [[ 1, 2, 3], # 2 samples, 3 features\n", " [11, 12, 13]]\n", "y = [0, 1] # classes of each sample" ] }, { "cell_type": "code", "execution_count": 3, "id": "049fe409-3f15-4adc-8b3e-6da4ebdc9aae", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomForestClassifier(random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomForestClassifier(random_state=0)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 4, "id": "449d209b-91a1-4e26-8bf4-a85a8af79e06", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\u001b[1;35marray\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.predict(X)  # predict classes of the training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "eb153439-0730-41ce-9120-69d14e0ff1d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\u001b[1;35marray\u001b[0m\u001b[1m(\u001b[0m\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.predict([[4, 5, 6], [14, 15, 16]])  # predict classes of new data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39dc0205-9e72-4b1e-86b9-a8d047f0a0d3",
   "metadata": {},
   "source": [
    "Note that we are able to see that `[4, 5, 6]` is more similar to `[1,2,3]` than `[11, 12, 13]` and therefore gets labeled `0`.\n",
    "\n",
    "## Train Test Split\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html\n",
    "\n",
    "Above we needed a training (or fitting) data set along with a testing (or predicting) data set. Scikit Learn has a method to help split a single data set into these two groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a629410b-ce87-4039-9842-f5bd1379da29",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1]\n",
      " [2 3]\n",
      " [4 5]\n",
      " [6 7]\n",
      " [8 9]]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\u001b[1m[\u001b[0m\u001b[1;36m0\u001b[0m, \u001b[1;36m1\u001b[0m, \u001b[1;36m2\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m4\u001b[0m\u001b[1m]\u001b[0m"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "X, y = np.arange(10).reshape((5, 2)), range(5)\n",
    "print(X)\n",
    "list(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "b168e235-cb06-46a6-a357-f53e970e2652",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[4 5]\n",
      " [0 1]\n",
      " [6 7]] [2, 0, 3] [[2 3]\n",
      " [8 9]] [1, 4]\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.33, random_state=42)\n",
    "print(X_train, y_train,\n",
    "      X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "039eb820-fd22-4fa6-b4a9-7aaff392644c",
   "metadata": {},
   "source": [
    "# Pipelines\n",
    "\n",
    "In machine learning, you will need to \n",
    "1. Load data\n",
    "2. Reject outliers/clean data\n",
    "3. Create training and testing sub-sets\n",
    "4. Train\n",
    "5. Test\n",
    "6. Itterate 4 & 5 to find the optimal model hyper-parameters.\n",
    "\n",
    "The `Training` step can have many substeps. In a simple example of no-outliers, it can include any data cleaning/rescaling. It would be benifitial to create an object that does all these steps first to fit the model parameters (4) then apply the model prameters as predictions (5). Scikit Learn has a `pipeline` module for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "7d630a10-7fbe-4326-a00e-bbd35c7010b5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\u001b[1;36m0.9736842105263158\u001b[0m"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.datasets import load_iris\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "# create a pipeline object\n",
    "pipe = make_pipeline(\n",
    "    StandardScaler(),\n",
    "    LogisticRegression()\n",
    ")\n",
    "# load the iris dataset and split it into train and test sets\n",
    "X, y = load_iris(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
    "# fit the whole pipeline\n",
    "pipe.fit(X_train, y_train)\n",
    "# we can now use it like any other estimator\n",
    "accuracy_score(pipe.predict(X_test), y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ade9641c-a35f-4c82-92eb-b4549301f58f",
   "metadata": {},
   "source": [
    "See how the scaling and logistic regression are both done the same way when fitting and predicting. This is more important when your process increases in complexity.\n",
    "\n",
    "## Model Hyper-parameter Optimization \n",
    "\n",
    "Scikit Learn also has a module for step 6, optimizing the hyper parameters. You can see in this example that `RandomizedSearchCV` will iterate through the distributions given in `param_distributions` to look for an optimal set of parameters. Rembember that `randint` is an object and not simply a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "6d711fa6-e914-4a15-a4c2-aedae5522049",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'max_depth': 9, 'n_estimators': 4}\n",
      "0.7353489874098169\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from sklearn.model_selection import train_test_split\n",
    "from scipy.stats import randint\n",
    "X, y = fetch_california_housing(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n",
    "# define the parameter space that will be searched over\n",
    "param_distributions = {'n_estimators': randint(1, 5),\n",
    "                       'max_depth': randint(5, 10)}\n",
    "# now create a searchCV object and fit it to the data\n",
    "search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),\n",
    "                            n_iter=5,\n",
    "                            param_distributions=param_distributions,\n",
    "                            random_state=0)\n",
    "search.fit(X_train, y_train)\n",
    "print(search.best_params_)\n",
    "\n",
    "# the search object now acts like a normal random forest estimator\n",
    "# with max_depth=9 and n_estimators=4\n",
    "print(search.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "25ad0a89-0ce2-40a7-acc3-3e2d89a56f04",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "\u001b[1;36m3\u001b[0m"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "randint(1,5).rvs()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ab694ce-9256-4258-8c04-eb50de299972",
   "metadata": {},
   "source": [
    "## Further Reading \n",
    "\n",
    "* Extra example - https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#sphx-glr-auto-examples-classification-plot-digits-classification-py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "510d1c3a-4b74-4d0f-87a3-199e8b980e02",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last updated: Fri May 02 2025 14:39:53CDT\n",
      "\n",
      "Python implementation: CPython\n",
      "Python version       : 3.12.10\n",
      "IPython version      : 9.2.0\n",
      "\n",
      "Compiler    : Clang 16.0.0 (clang-1600.0.26.6)\n",
      "OS          : Darwin\n",
      "Release     : 24.4.0\n",
      "Machine     : arm64\n",
      "Processor   : arm\n",
      "CPU cores   : 12\n",
      "Architecture: 64bit\n",
      "\n",
      "rich      : 14.0.0\n",
      "sklearn   : 1.6.1\n",
      "numpy     : 2.1.3\n",
      "matplotlib: 3.10.1\n",
      "pandas    : 2.2.3\n",
      "\n",
      "Watermark: 2.5.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -untzvm -iv -w"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}