diff --git a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-tmp-v2.ipynb b/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-tmp-v2.ipynb deleted file mode 100644 index d7cd941d8..000000000 --- a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-tmp-v2.ipynb +++ /dev/null @@ -1,12801 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "TK - add Google Colab link as well as reference notebook etc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning \n", - "\n", - "In this notebook, we're going to go through an example machine learning project to use the characteristics of bulldozers and their past sales prices to predict the sale price of future bulldozers based on their characteristics.\n", - "\n", - "* **Inputs:** Bulldozer characteristics such as make year, base model, model series, state of sale (e.g. which US state was it sold in), drive system and more.\n", - "* **Outputs:** Bulldozer sale price (in USD).\n", - "\n", - "Since we're trying to predict a number, this kind of problem is known as a **regression problem**.\n", - "\n", - "And since we're going to predicting results with a time component (predicting future sales based on past sales), this is also known as a **time series** or **forecasting** problem.\n", - "\n", - "The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the [Kaggle Bluebook for Bulldozers competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview).\n", - "\n", - "The techniques used in here have been inspired and adapted from [the fast.ai machine learning course](https://course18.fast.ai/ml)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.\n", - "\n", - "| | \n", - "|:--:| \n", - "| 6 Step Machine Learning Modelling Framework ([read more](https://whimsical.com/9g65jgoRYTxMXxDosndYTB)) |\n", - "\n", - "To work through these topics, we'll use pandas, Matplotlib and NumPy for data analysis, as well as, Scikit-Learn for machine learning and modelling tasks.\n", - "\n", - "| | \n", - "|:--:| \n", - "| Tools that can be used for each step of the machine learning modelling process. |\n", - "\n", - "We'll work through each step and by the end of the notebook, we'll have a trained machine learning model which predicts the sale price of a bulldozer given different characteristics about it." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 6 Step Machine Learning Framework\n", - "\n", - "#### 1. Problem Definition\n", - "\n", - "For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,\n", - "\n", - "> How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 2. Data\n", - "\n", - "Looking at the [dataset from Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data) we see that it contains historical sales data of bulldozers. Including things like, model type, size, sale date and more.\n", - "\n", - "There are 3 datasets:\n", - "\n", - "1. **Train.csv** - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including `SalePrice` which is the **target variable**).\n", - "2. **Valid.csv** - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as **Train.csv**).\n", - "3. **Test.csv** - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the `SalePrice` attribute, as this is what we'll be trying to predict).\n", - "\n", - "> **Note:** You can download the dataset `bluebook-for-bulldozers` dataset directly from Kaggle. Alternatively, you can also [download it directly from the course GitHub](https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3. Evaluation\n", - "\n", - "For this problem, [Kaggle has set the evaluation metric to being root mean squared log error (RMSLE)](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation). As with many regression evaluations, the goal will be to get this value as low as possible (a low error value means our model's predictions are close to what the real values are).\n", - "\n", - "To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the [Kaggle leaderboard](https://www.kaggle.com/c/bluebook-for-bulldozers/leaderboard)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 4. Features\n", - "\n", - "Features are different parts and attributes of the data. \n", - "\n", - "During this step, you'll want to start finding out what you can about the data.\n", - "\n", - "One of the most common ways to do this is to create a **data dictionary**.\n", - "\n", - "For this dataset, Kaggle provides a data dictionary which contains information about what each attribute of the dataset means. \n", - "\n", - "For example: \n", - "\n", - "| Variable Name | Description | Variable Type |\n", - "|------|-----|-----|\n", - "| SalesID | unique identifier of a particular sale of a machine at auction | Independent variable |\n", - "| MachineID | identifier for a particular machine; machines may have multiple sales | Independent variable |\n", - "| ModelID | identifier for a unique machine model (i.e. fiModelDesc) | Independent variable |\n", - "| datasource | source of the sale record; some sources are more diligent about reporting attributes of the machine than others. Note that a particular datasource may report on multiple auctioneerIDs. | Independent variable |\n", - "| auctioneerID | identifier of a particular auctioneer, i.e. company that sold the machine at auction. Not the same as datasource. | Independent variable |\n", - "| YearMade | year of manufacturer of the Machine | Independent variable |\n", - "| MachineHoursCurrentMeter | current usage of the machine in hours at time of sale (saledate); null or 0 means no hours have been reported for that sale | Independent variable |\n", - "| UsageBand | value (low, medium, high) calculated comparing this particular Machine-Sale hours to average usage for the fiBaseModel; e.g. 'Low' means this machine has fewer hours given its lifespan relative to the average of fiBaseModel. | Independent variable |\n", - "| Saledate | time of sale | Independent variable |\n", - "| fiModelDesc | Description of a unique machine model (see ModelID); concatenation of fiBaseModel & fiSecondaryDesc & fiModelSeries & fiModelDescriptor | Independent variable |\n", - "| State | US State in which sale occurred | Independent variable |\n", - "| Drive_System | machine configuration; typically describes whether 2 or 4 wheel drive | Independent variable |\n", - "| Enclosure | machine configuration - does the machine have an enclosed cab or not | Independent variable |\n", - "| Forks | machine configuration - attachment used for lifting | Independent variable |\n", - "| Pad_Type | machine configuration - type of treads a crawler machine uses | Independent variable |\n", - "| Ride_Control | machine configuration - optional feature on loaders to make the ride smoother | Independent variable |\n", - "| Transmission | machine configuration - describes type of transmission; typically automatic or manual | Independent variable |\n", - "| ... | ... | ... |\n", - "| SalePrice | cost of sale in USD | Target/dependent variable | \n", - "\n", - "You can download the full version of this file directly from the [Kaggle competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/download/Bnl6RAHA0enbg0UfAvGA%2Fversions%2FwBG4f35Q8mAbfkzwCeZn%2Ffiles%2FData%20Dictionary.xlsx) (Kaggle account required) or view it [on Google Sheets](https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing).\n", - "\n", - "With all of this being known, let's get started! \n", - "\n", - "First, we'll import the dataset and start exploring. " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Notebook last run (end-to-end): 2024-10-30 11:54:38.504966\n" - ] - } - ], - "source": [ - "# Timestamp\n", - "import datetime\n", - "\n", - "import datetime\n", - "print(f\"Notebook last run (end-to-end): {datetime.datetime.now()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Importing the data and preparing it for modelling\n", - "\n", - "First thing is first, let's get the libraries we need imported and the data we'll need for the project.\n", - "\n", - "We'll start by importing pandas, NumPy and matplotlib." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "pandas version: 2.2.2\n", - "NumPy version: 2.1.1\n", - "matplotlib version: 3.9.2\n" - ] - } - ], - "source": [ - "# Import data analysis tools \n", - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Print the versions we're using (as long as your versions are equal or higher than these, the code should work)\n", - "print(f\"pandas version: {pd.__version__}\")\n", - "print(f\"NumPy version: {np.__version__}\")\n", - "print(f\"matplotlib version: {matplotlib.__version__}\") " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've got our tools for data analysis ready, we can import the data and start to explore it.\n", - "\n", - "For this project, I've [downloaded the data from Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data) and stored it on the [course GitHub](https://github.com/mrdbourke/zero-to-mastery-ml/) under the file path [`../data/bluebook-for-bulldozers`](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/bluebook-for-bulldozers.zip).\n", - "\n", - "We can write some code to check if the files are available locally (on our computer) and if not, we can download them.\n", - "\n", - "> **Note:** If you're running this notebook on Google Colab, the code below will enable you to download the dataset programmatically. Just beware that each time Google Colab shuts down, the data will have to be redownloaded. There's also an [example Google Colab notebook](https://colab.research.google.com/drive/1hf1rTcCAQP1EN8pZ0ZIqjjEy47dwzbiv?usp=sharing) showing how to download the data programmatically." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!\n", - "[INFO] Current dataset dir: ../data/bluebook-for-bulldozers\n" - ] - } - ], - "source": [ - "from pathlib import Path\n", - "\n", - "# Check if 'bluebook-for-bulldozers' exists in the current or parent directory\n", - "# Link to data (see the file \"bluebook-for-bulldozers\"): https://github.com/mrdbourke/zero-to-mastery-ml/tree/master/data\n", - "dataset_dir = Path(\"../data/bluebook-for-bulldozers\")\n", - "if not (dataset_dir.is_dir()):\n", - " print(f\"[INFO] Can't find existing 'bluebook-for-bulldozers' dataset in current directory or parent directory, downloading...\")\n", - "\n", - " # Download and unzip the bluebook for bulldozers dataset\n", - " !wget https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip\n", - " !unzip bluebook-for-bulldozers.zip\n", - "\n", - " # Ensure a data directory exists and move the downloaded dataset there\n", - " !mkdir ../data/\n", - " !mv bluebook-for-bulldozers ../data/\n", - " print(f\"[INFO] Current dataset dir: {dataset_dir}\")\n", - "\n", - " # Remove .zip file from notebook directory\n", - " !rm -rf bluebook-for-bulldozers.zip\n", - "else:\n", - " # If the target dataset directory exists, we don't need to download it\n", - " print(f\"[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!\")\n", - " print(f\"[INFO] Current dataset dir: {dataset_dir}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Dataset downloaded!\n", - "\n", - "Let's check what files are available." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Files/folders available in ../data/bluebook-for-bulldozers:\n" - ] - }, - { - "data": { - "text/plain": [ - "['random_forest_benchmark_test.csv',\n", - " 'Valid.csv',\n", - " 'median_benchmark.csv',\n", - " 'Valid.zip',\n", - " 'TrainAndValid.7z',\n", - " 'Test.csv',\n", - " 'predictions.csv',\n", - " 'Train.7z',\n", - " 'TrainAndValid_object_values_as_categories.parquet',\n", - " 'test_predictions.csv',\n", - " 'ValidSolution.csv',\n", - " 'train_tmp.csv',\n", - " 'Machine_Appendix.csv',\n", - " 'Train.csv',\n", - " 'Valid.7z',\n", - " 'TrainAndValid_object_values_as_categories.csv',\n", - " 'TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet',\n", - " 'Data Dictionary.xlsx',\n", - " 'TrainAndValid.csv',\n", - " 'Train.zip',\n", - " 'TrainAndValid.zip']" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "\n", - "print(f\"[INFO] Files/folders available in {dataset_dir}:\")\n", - "os.listdir(dataset_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can explore each of these files individually or read about them on the [Kaggle Competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/data).\n", - "\n", - "For now, the main file we're interested in is `TrainAndValid.csv` (this is also a combination of `Train.csv` and `Valid.csv`), this is a combination of the training and validation datasets.\n", - "\n", - "* The training data (`Train.csv`) contains sale data from 1989 up to the end of 2011.\n", - "* The validation data (`Valid.csv`) contains sale data from January 1, 2012 - April 30, 2012.\n", - "* The test data (`Test.csv`) contains sale data from May 1, 2012 - November 2012.\n", - "\n", - "We'll use the training data to train our model to predict the sale price of bulldozers, we'll then validate its performance on the validation data to see if our model can be improved in any way. Finally, we'll evaluate our best model on the test dataset.\n", - "\n", - "But more on this later on.\n", - "\n", - "Let's import the `TrainAndValid.csv` file and turn it into a pandas DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/1127193594.py:2: DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False.\n", - " df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")\n" - ] - } - ], - "source": [ - "# Import the training and validation set\n", - "df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful! We've got our DataFrame ready to explore.\n", - "\n", - "You might see a warning appear in the form:\n", - "\n", - "`DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")`\n", - "\n", - "This is just saying that some of our columns have multiple/mixed data types. For example, a column may contain strings but also contain integers. This is okay for now and can be addressed later on if necessary.\n", - "\n", - "How about we get some information about our DataFrame?\n" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 53 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 saledate 412698 non-null object \n", - " 10 fiModelDesc 412698 non-null object \n", - " 11 fiBaseModel 412698 non-null object \n", - " 12 fiSecondaryDesc 271971 non-null object \n", - " 13 fiModelSeries 58667 non-null object \n", - " 14 fiModelDescriptor 74816 non-null object \n", - " 15 ProductSize 196093 non-null object \n", - " 16 fiProductClassDesc 412698 non-null object \n", - " 17 state 412698 non-null object \n", - " 18 ProductGroup 412698 non-null object \n", - " 19 ProductGroupDesc 412698 non-null object \n", - " 20 Drive_System 107087 non-null object \n", - " 21 Enclosure 412364 non-null object \n", - " 22 Forks 197715 non-null object \n", - " 23 Pad_Type 81096 non-null object \n", - " 24 Ride_Control 152728 non-null object \n", - " 25 Stick 81096 non-null object \n", - " 26 Transmission 188007 non-null object \n", - " 27 Turbocharged 81096 non-null object \n", - " 28 Blade_Extension 25983 non-null object \n", - " 29 Blade_Width 25983 non-null object \n", - " 30 Enclosure_Type 25983 non-null object \n", - " 31 Engine_Horsepower 25983 non-null object \n", - " 32 Hydraulics 330133 non-null object \n", - " 33 Pushblock 25983 non-null object \n", - " 34 Ripper 106945 non-null object \n", - " 35 Scarifier 25994 non-null object \n", - " 36 Tip_Control 25983 non-null object \n", - " 37 Tire_Size 97638 non-null object \n", - " 38 Coupler 220679 non-null object \n", - " 39 Coupler_System 44974 non-null object \n", - " 40 Grouser_Tracks 44875 non-null object \n", - " 41 Hydraulics_Flow 44875 non-null object \n", - " 42 Track_Type 102193 non-null object \n", - " 43 Undercarriage_Pad_Width 102916 non-null object \n", - " 44 Stick_Length 102261 non-null object \n", - " 45 Thumb 102332 non-null object \n", - " 46 Pattern_Changer 102261 non-null object \n", - " 47 Grouser_Type 102193 non-null object \n", - " 48 Backhoe_Mounting 80712 non-null object \n", - " 49 Blade_Type 81875 non-null object \n", - " 50 Travel_Controls 81877 non-null object \n", - " 51 Differential_Type 71564 non-null object \n", - " 52 Steering_Controls 71522 non-null object \n", - "dtypes: float64(3), int64(5), object(45)\n", - "memory usage: 166.9+ MB\n" - ] - } - ], - "source": [ - "# Get info about DataFrame\n", - "df.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! Over 400,000 entries!\n", - "\n", - "That's a much larger dataset than what we've worked with before.\n", - "\n", - "One thing you might have noticed is that the `saledate` column value is being treated as a Python object (it's okay if you didn't notice, these things take practice).\n", - "\n", - "When the `Dtype` is `object`, it's saying that it's a string.\n", - "\n", - "However, when we look at it...\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 11/16/2006 0:00\n", - "1 3/26/2004 0:00\n", - "2 2/26/2004 0:00\n", - "3 5/19/2011 0:00\n", - "4 7/23/2009 0:00\n", - "5 12/18/2008 0:00\n", - "6 8/26/2004 0:00\n", - "7 11/17/2005 0:00\n", - "8 8/27/2009 0:00\n", - "9 8/9/2007 0:00\n", - "Name: saledate, dtype: object" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df[\"saledate\"][:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see that these `object`'s are in the form of dates.\n", - "\n", - "Since we're working on a **time series** problem (a machine learning problem with a time component), it's probably worth it to turn these strings into Python `datetime` objects.\n", - "\n", - "Before we do, let's try visualize our `saledate` column against our `SalePrice` column.\n", - "\n", - "To do so, we can create a scatter plot.\n", - "\n", - "And to prevent our plot from being too big, how about we visualize the first 1000 values?" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df[\"saledate\"][:1000], # visualize the first 1000 values\n", - " y=df[\"SalePrice\"][:1000])\n", - "ax.set_xlabel(\"Sale Date\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... looks like the x-axis is quite crowded.\n", - "\n", - "Maybe we can fix this by turning the `saledate` column into `datetime` format.\n", - "\n", - "Good news is that is looks like our `SalePrice` column is already in `float64` format so we can view its distribution directly from the DataFrame using a histogram plot." - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# View SalePrice distribution \n", - "df.SalePrice.plot.hist(xlabel=\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.1 Parsing dates\n", - "\n", - "When working with time series data, it's a good idea to make sure any date data is the format of a [datetime object](https://docs.python.org/3/library/datetime.html) (a Python data type which encodes specific information about dates).\n", - "\n", - "We can tell pandas which columns to read in as dates by setting the `parse_dates` parameter in [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).\n", - "\n", - "Once we've imported our CSV with the `saledate` column parsed, we can view information about our DataFrame again with `df.info()`. " - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 53 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64 \n", - " 8 UsageBand 73670 non-null object \n", - " 9 saledate 412698 non-null datetime64[ns]\n", - " 10 fiModelDesc 412698 non-null object \n", - " 11 fiBaseModel 412698 non-null object \n", - " 12 fiSecondaryDesc 271971 non-null object \n", - " 13 fiModelSeries 58667 non-null object \n", - " 14 fiModelDescriptor 74816 non-null object \n", - " 15 ProductSize 196093 non-null object \n", - " 16 fiProductClassDesc 412698 non-null object \n", - " 17 state 412698 non-null object \n", - " 18 ProductGroup 412698 non-null object \n", - " 19 ProductGroupDesc 412698 non-null object \n", - " 20 Drive_System 107087 non-null object \n", - " 21 Enclosure 412364 non-null object \n", - " 22 Forks 197715 non-null object \n", - " 23 Pad_Type 81096 non-null object \n", - " 24 Ride_Control 152728 non-null object \n", - " 25 Stick 81096 non-null object \n", - " 26 Transmission 188007 non-null object \n", - " 27 Turbocharged 81096 non-null object \n", - " 28 Blade_Extension 25983 non-null object \n", - " 29 Blade_Width 25983 non-null object \n", - " 30 Enclosure_Type 25983 non-null object \n", - " 31 Engine_Horsepower 25983 non-null object \n", - " 32 Hydraulics 330133 non-null object \n", - " 33 Pushblock 25983 non-null object \n", - " 34 Ripper 106945 non-null object \n", - " 35 Scarifier 25994 non-null object \n", - " 36 Tip_Control 25983 non-null object \n", - " 37 Tire_Size 97638 non-null object \n", - " 38 Coupler 220679 non-null object \n", - " 39 Coupler_System 44974 non-null object \n", - " 40 Grouser_Tracks 44875 non-null object \n", - " 41 Hydraulics_Flow 44875 non-null object \n", - " 42 Track_Type 102193 non-null object \n", - " 43 Undercarriage_Pad_Width 102916 non-null object \n", - " 44 Stick_Length 102261 non-null object \n", - " 45 Thumb 102332 non-null object \n", - " 46 Pattern_Changer 102261 non-null object \n", - " 47 Grouser_Type 102193 non-null object \n", - " 48 Backhoe_Mounting 80712 non-null object \n", - " 49 Blade_Type 81875 non-null object \n", - " 50 Travel_Controls 81877 non-null object \n", - " 51 Differential_Type 71564 non-null object \n", - " 52 Steering_Controls 71522 non-null object \n", - "dtypes: datetime64[ns](1), float64(3), int64(5), object(44)\n", - "memory usage: 166.9+ MB\n" - ] - } - ], - "source": [ - "df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\",\n", - " low_memory=False, # set low_memory=False to prevent mixed data types warning \n", - " parse_dates=[\"saledate\"]) # can use the parse_dates parameter and specify which column to treat as a date column\n", - "\n", - "# With parse_dates... check dtype of \"saledate\"\n", - "df.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "Looks like our `saledate` column is now of type [`datetime64[ns]`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.datetime64), a NumPy-specific datetime format with high precision.\n", - "\n", - "Since pandas works well with NumPy, we can keep it in this format.\n", - "\n", - "How about we view a few samples from our `SaleDate` column again?" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2006-11-16\n", - "1 2004-03-26\n", - "2 2004-02-26\n", - "3 2011-05-19\n", - "4 2009-07-23\n", - "5 2008-12-18\n", - "6 2004-08-26\n", - "7 2005-11-17\n", - "8 2009-08-27\n", - "9 2007-08-09\n", - "Name: saledate, dtype: datetime64[ns]" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df[\"saledate\"][:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! That's looking much better already. \n", - "\n", - "We'll see how having our dates in this format is really helpful later on.\n", - "\n", - "For now, how about we visualize our `saledate` column against our `SalePrice` column again?" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAmUAAAGwCAYAAADolBImAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8hTgPZAAAACXBIWXMAAA9hAAAPYQGoP6dpAAC0v0lEQVR4nOydeXgT1frHv0n3Fpq2VJoWWcpOLVBAgcpyFQtFUMHlXjY35MIV6b0sLoDKoqgI/hTwIqKouAHuIALWWyiyWfa1lN0CCg1IV2jpQjO/P+rEJM0kZ7Zkkryf5+HRJicz58ycOfOed9VxHMeBIAiCIAiC8Ch6T3eAIAiCIAiCIKGMIAiCIAhCE5BQRhAEQRAEoQFIKCMIgiAIgtAAJJQRBEEQBEFoABLKCIIgCIIgNAAJZQRBEARBEBog0NMd8CfMZjMuXryIhg0bQqfTebo7BEEQBEEwwHEcrl69ioSEBOj16umzSChzIxcvXkTTpk093Q2CIAiCICTw22+/4eabb1bt+CSUuZGGDRsCqLupkZGRHu4NQRAEQRAslJWVoWnTppb3uFqQUOZGeJNlZGQkCWUEQRAE4WWo7XpEjv4EQRAEQRAagIQygiAIgiAIDUBCGUEQBEEQhAYgoYwgCIIgCEIDkFBGEARBEAShAUgoIwiCIAiC0AAklBEEQRAEQWgAEsoIgiAIgiA0AAllBEEQBEEQGoAy+hMEQVhRa+awO78Il69WonHDUHRPjEGAXt0s3gRBEAAJZQRBEBYycwvw0g95KCittHwWbwjFrHuTMDA53oM9IwjCHyDzJUEQBOoEsvGf77cRyADAVFqJ8Z/vR2ZugYd6RhCEv0BCGUEQfk+tmcNLP+SBc/Ad/9lLP+Sh1uyoBUEQhDJ4VCjbunUr7r33XiQkJECn02HNmjWCbZ988knodDosXLjQ5vOioiKMGjUKkZGRiIqKwpgxY3Dt2jWbNocPH0afPn0QGhqKpk2bYv78+fWO//XXX6N9+/YIDQ1Fx44dsWHDBpvvOY7DzJkzER8fj7CwMKSlpeHUqVOSx04QhHbYnV9UT0NmDQegoLQSu/OL3NcpgiD8Do8KZeXl5ejcuTPeeecdp+1Wr16NnTt3IiEhod53o0aNwtGjR5GVlYV169Zh69atGDdunOX7srIyDBgwAM2bN8e+ffvwxhtvYPbs2Xj//fctbX755ReMGDECY8aMwYEDBzB06FAMHToUubm5ljbz58/H22+/jaVLl2LXrl2IiIhAeno6KiuFF3KCILyDy1fZnmPWdgRBEFLQcRynCX28TqfD6tWrMXToUJvPL1y4gB49euCnn37C4MGDMWnSJEyaNAkAcOzYMSQlJWHPnj249dZbAQCZmZkYNGgQfv/9dyQkJODdd9/FCy+8AJPJhODgYADAtGnTsGbNGhw/fhwAMGzYMJSXl2PdunWW8/bs2RMpKSlYunQpOI5DQkICnn76aTzzzDMAgNLSUsTFxeHjjz/G8OHDHY6pqqoKVVVVlr/LysrQtGlTlJaWIjIyUpHrRhCEfHLOFGLEsp0u260a2xOprRq5oUcEQWiJsrIyGAwG1d/fmvYpM5vNeOSRR/Dss8/illtuqfd9Tk4OoqKiLAIZAKSlpUGv12PXrl2WNn379rUIZACQnp6OEydOoLi42NImLS3N5tjp6enIyckBAOTn58NkMtm0MRgM6NGjh6WNI+bOnQuDwWD517RpUwlXgSAItemeGIN4QyiEEl/oUBeF2T0xxp3dIgjCz9C0UDZv3jwEBgbiP//5j8PvTSYTGjdubPNZYGAgYmJiYDKZLG3i4uJs2vB/u2pj/b317xy1ccT06dNRWlpq+ffbb785HS9BEJ4hQK/DrHuTAKCeYMb/PeveJMpXRhCEqmg2T9m+ffuwaNEi7N+/Hzqddy6EISEhCAkJ8XQ3CIJgYGByPN59uGu9PGVGylNGEISb0KxQtm3bNly+fBnNmjWzfFZbW4unn34aCxcuxNmzZ2E0GnH58mWb3924cQNFRUUwGo0AAKPRiEuXLtm04f921cb6e/6z+Ph4mzYpKSkKjJYgCC0wMDke/ZOMlNGfIAiPoFnz5SOPPILDhw/j4MGDln8JCQl49tln8dNPPwEAUlNTUVJSgn379ll+l52dDbPZjB49eljabN26FTU1NZY2WVlZaNeuHaKjoy1tNm3aZHP+rKwspKamAgASExNhNBpt2pSVlWHXrl2WNgRB+AYBeh1SWzXCkJQmSG3ViAQygiDchkc1ZdeuXcPp06ctf+fn5+PgwYOIiYlBs2bN0KiRbZRTUFAQjEYj2rVrBwDo0KEDBg4ciLFjx2Lp0qWoqalBRkYGhg8fbkmfMXLkSLz00ksYM2YMpk6ditzcXCxatAgLFiywHHfixIn429/+hjfffBODBw/GF198gb1791rSZuh0OkyaNAmvvPIK2rRpg8TERMyYMQMJCQn1okUJgiAIgiAkwXmQzZs3c6jLy2jz77HHHnPYvnnz5tyCBQtsPissLORGjBjBNWjQgIuMjORGjx7NXb161abNoUOHuN69e3MhISFckyZNuNdff73esb/66iuubdu2XHBwMHfLLbdw69evt/nebDZzM2bM4OLi4riQkBDurrvu4k6cOCFqvKWlpRwArrS0VNTvCIIgCILwHO56f2smT5k/4K48JwRBEARBKAflKSMIgiAIgvAjSCgjCIIgCILQACSUEQRBEARBaAASygiCIAiCIDQACWUEQRAEQRAagIQygiAIgiAIDUBCGUEQBEEQhAYgoYwgCIIgCEIDkFBGEARBEAShAUgoIwiCIAiC0AAklBEEQRAEQWgAEsoIgiAIgiA0AAllBEEQBEEQGoCEMoIgCIIgCA1AQhlBEARBEIQGIKGMIAiCIAhCA5BQRhAEQRAEoQECPd0BgiAIgiCIWjOH3flFuHy1Eo0bhqJ7YgwC9DpPd8utkFBGEARBEIRHycwtwEs/5KGgtNLyWbwhFLPuTcLA5HgP9sy9kPmSIAjNU2vmkHOmEN8fvICcM4WoNXOe7hJBEAqRmVuA8Z/vtxHIAMBUWonxn+9HZm6Bh3rmfkhTRhCEpqEdNEH4LrVmDi/9kAdH2ywOgA7ASz/koX+S0S9MmaQpIwhCs9AOmiB8m935RfWeb2s4AAWlldidX+S+TnkQ0pQRBCPe5ITqTX0VgnbQBOH7XL4qLJBJaeftkFBGEAx4kwnNm/rqDDE76NRWjdzXMYIgFKNxw1BF23k7ZL4kCBd4kwnNm/rqCtpBE4Tv0z0xBvGGUAjpunWo21R2T4xxZ7c8BgllBOEEVyY0oM6EpoVoQG/qKwu0gyYI3ydAr8Ose5MAoJ5gxv89694kv3FRIKGMIJzgTU6o3tRXFmgHTRD+wcDkeLz7cFcYDbYbLKMhFO8+3NWr3C7kQj5lBOEEbzKheVNfWeB30OM/3w8dYKMB9McdNEH4MgOT49E/yej1AUpyIaGMIJzgTSY0b+orK/wO2j5wweiFgQsEQTgnQK/z+6AdEsoIwgm8Cc1UWunQV0uHOgFBCyY0b+qrGGgHTRCEv0A+ZQThBG9yQvWmvoqF30EPSWmC1FaNvHIMBEEQriChjCBc4E1OqN7UV4IgCMIWHcdx3hEf7wOUlZXBYDCgtLQUkZGRnu4OIRJvypLvTX0lCILQOu56f5NPGUEw4k1OqN7UV4IgCKIOEsoIgiAkQhpJgiCUhIQygiAICfhKjVGCILQDOfoTBEGIxJdqjBIEoR1IKCMIghCBr9UYJQhCO3hUKNu6dSvuvfdeJCQkQKfTYc2aNZbvampqMHXqVHTs2BERERFISEjAo48+iosXL9oco6ioCKNGjUJkZCSioqIwZswYXLt2zabN4cOH0adPH4SGhqJp06aYP39+vb58/fXXaN++PUJDQ9GxY0ds2LDB5nuO4zBz5kzEx8cjLCwMaWlpOHXqlHIXgyAIr8DXaowSBKEdPCqUlZeXo3PnznjnnXfqfVdRUYH9+/djxowZ2L9/P7777jucOHEC9913n027UaNG4ejRo8jKysK6deuwdetWjBs3zvJ9WVkZBgwYgObNm2Pfvn144403MHv2bLz//vuWNr/88gtGjBiBMWPG4MCBAxg6dCiGDh2K3NxcS5v58+fj7bffxtKlS7Fr1y5EREQgPT0dlZXeUUeQqE+tmUPOmUJ8f/ACcs4UkmaDYMLXaowSBKEdNJOnTKfTYfXq1Rg6dKhgmz179qB79+44d+4cmjVrhmPHjiEpKQl79uzBrbfeCgDIzMzEoEGD8PvvvyMhIQHvvvsuXnjhBZhMJgQHBwMApk2bhjVr1uD48eMAgGHDhqG8vBzr1q2znKtnz55ISUnB0qVLwXEcEhIS8PTTT+OZZ54BAJSWliIuLg4ff/wxhg8fzjRGylOmHchJm5BKzplCjFi202W7VWN7UloSgvAR3PX+9iqfstLSUuh0OkRFRQEAcnJyEBUVZRHIACAtLQ16vR67du2ytOnbt69FIAOA9PR0nDhxAsXFxZY2aWlpNudKT09HTk4OACA/Px8mk8mmjcFgQI8ePSxtHFFVVYWysjKbf4TnISdtQg58jVGhxBc61An43lZjlCAIz+M1QlllZSWmTp2KESNGWKRUk8mExo0b27QLDAxETEwMTCaTpU1cXJxNG/5vV22sv7f+naM2jpg7dy4MBoPlX9OmTUWNmVAeb3PSJhOr9vDlGqMEQXgWr8hTVlNTg3/84x/gOA7vvvuup7vDzPTp0zFlyhTL32VlZSSYeRgxTtqeNj2RiVW78DVG7e+Pke4PQRAy0LxQxgtk586dQ3Z2to0t12g04vLlyzbtb9y4gaKiIhiNRkubS5cu2bTh/3bVxvp7/rP4+HibNikpKYJ9DwkJQUhIiJjhEirjLU7avInVXi/Gm1ipuLjnGZgcj/5JRsroTxCEYmjafMkLZKdOncLGjRvRqJGt5iI1NRUlJSXYt2+f5bPs7GyYzWb06NHD0mbr1q2oqamxtMnKykK7du0QHR1tabNp0yabY2dlZSE1NRUAkJiYCKPRaNOmrKwMu3btsrQhvIPGDUMVbacG3mZi9Wf4GqNDUpogtVUjEsgIgpCFR4Wya9eu4eDBgzh48CCAOof6gwcP4vz586ipqcFDDz2EvXv3YsWKFaitrYXJZILJZEJ1dTUAoEOHDhg4cCDGjh2L3bt3Y8eOHcjIyMDw4cORkJAAABg5ciSCg4MxZswYHD16FF9++SUWLVpkY1acOHEiMjMz8eabb+L48eOYPXs29u7di4yMDAB1kaGTJk3CK6+8grVr1+LIkSN49NFHkZCQ4DRalNAe3uCkTXmwCIIg/BOPCmV79+5Fly5d0KVLFwDAlClT0KVLF8ycORMXLlzA2rVr8fvvvyMlJQXx8fGWf7/88ovlGCtWrED79u1x1113YdCgQejdu7dNDjKDwYD//e9/yM/PR7du3fD0009j5syZNrnMbr/9dqxcuRLvv/8+OnfujG+++QZr1qxBcnKypc1zzz2Hf//73xg3bhxuu+02XLt2DZmZmQgN9ZxGhRCPNzhpe4uJlSAIglAWzeQp8wcoT5l20LITPeXBIgiC0Bbuen9r3tGfINRAy07avInVVFrp0K9Mh7ooP8qDRRAE4VuQUEb4LbyTttbgTazjP98PHWAjmGnFxEoQBEEoj6ajLwnCX+HzYBkNtj6LRkMopcMgCILwUUhTRhAaRcsmVoIgCEJ5SCgjCA2jVRMrQRAEoTwklBEEQbig1syRxtID0HUn/A0SygiCIJyg5fQpvgxdd8IfIUd/giAIAfgapPYVFvgapJm5BR7qmW9D153wV0goIwiCcADVIPUMdN0Jf4aEMoIgCAdQDVLPQNed8GdIKCMIgnAA1SD1DHTdCX+GhDKCIAgHNG4Y6rqRiHYEG3TdCX+GhDKCIPyaWjOHnDOF+P7gBeScKbT4KvE1SIUSMOhQFw1INUiVha474c9QSgyCIPwWV2kXqAap+6Har4Q/Q5oygiD8Epa0C1SD1DPQdSf8FR3HcRRX7CbKyspgMBhQWlqKyMhIT3eH8CPEZkb39UzqtWYOvedlC0b56VAnAGyf2g8Bep3PXw+tQted0Aruen+T+ZIgfByxmdH9IZO6mLQLqa0aUQ1SD0HXnfA3yHxJED6M2Mzo/pJJndIuEAShRUgoIwgfRWxmdH/KpE5pFwiC0CIklBGEjyI2M7o/ZVKntAsEQWgREsoIwkcRa6LzJ5Men3YBQD3BjNIuEAThKUgoIwgfRayJzt9MepR2gSAIrUHRlwTho/AmOlNppUM/MT7tA2+iE9teKlpKczAwOR79k4ya6Q9BEP4NCWUE4aOIzYzujkzqWky3QWkXCILQCmS+JAgfRqyJTk2Tnr+k2yAIgpAKZfR3I5TRn/AUns7oLzaDPkEQhJagjP4EQSiGWBOd0iY9sRn0CYIg/BEyXxIEoTr+lG6DIAhCKiSUEQShOv6WboMgCEIKJJQRBKE6lEGfIAjCNSSUEQShOpRBnyAIwjUklBEE4RYogz5BEIRzKPqSIAi34SyDvpYy/fOI6ZMW+09oA5obBCsklBEE4VYcpdvQYqZ/MX3SYv8JbUBzgxADJY91I5Q8liDqw2f6t1+IeD2CJ0ybYvqkxf4T2oDmhu/grvc3+ZQRBOExas0cXvohz2EBdP6zl37IQ61Z/b1jrZlDzplCrD5wAc+vPuKyT9U3zNhx+gqmfeu6rTv6T2gLLc1twnsg8yVBEB5DK5n+HZmYXPWp59xNKCqvZmpLlQr8D63MbcK7IKGMIAiPoYVM/0ImJle4EsisoUoF/ocW5jbhfXjUfLl161bce++9SEhIgE6nw5o1a2y+5zgOM2fORHx8PMLCwpCWloZTp07ZtCkqKsKoUaMQGRmJqKgojBkzBteuXbNpc/jwYfTp0wehoaFo2rQp5s+fX68vX3/9Ndq3b4/Q0FB07NgRGzZsEN0XgiDE4elM/85MTEpClQr8D0/PbcI78ahQVl5ejs6dO+Odd95x+P38+fPx9ttvY+nSpdi1axciIiKQnp6Oysq/dhajRo3C0aNHkZWVhXXr1mHr1q0YN26c5fuysjIMGDAAzZs3x759+/DGG29g9uzZeP/99y1tfvnlF4wYMQJjxozBgQMHMHToUAwdOhS5ubmi+kIQ/gDve/X9wQvIOVMoyyfG05n+XZmY5EKVCrSLkvPYEZ6e24R3opnoS51Oh9WrV2Po0KEA6jRTCQkJePrpp/HMM88AAEpLSxEXF4ePP/4Yw4cPx7Fjx5CUlIQ9e/bg1ltvBQBkZmZi0KBB+P3335GQkIB3330XL7zwAkwmE4KDgwEA06ZNw5o1a3D8+HEAwLBhw1BeXo5169ZZ+tOzZ0+kpKRg6dKlTH1hgaIvCW9HjfB+3nwIwEZj5Y4Ite8PXsDELw6qcmyKsNMu7kpT4cm5TSiL30df5ufnw2QyIS0tzfKZwWBAjx49kJOTAwDIyclBVFSURSADgLS0NOj1euzatcvSpm/fvhaBDADS09Nx4sQJFBcXW9pYn4dvw5+HpS+OqKqqQllZmc0/gvBW+BeMvWbJVFqJ8Z/vR2ZugaTjejLTv5qmI6pUoE3UmseOoCoWhFg06+hvMpkAAHFxcTafx8XFWb4zmUxo3LixzfeBgYGIiYmxaZOYmFjvGPx30dHRMJlMLs/jqi+OmDt3Ll566SXXgyUIjcMS3j/t2yNoGBqEni0bic5W7izTv5rwJiZTaaVifmVR4UF4Z0RX9Gwl/joQjlEqI76reaxDXZqK/klGxe6dp+Y24Z1oVijzBaZPn44pU6ZY/i4rK0PTpk092COCkAaL71XJ9RqM+mCXZDOQo0z/asMXSh//+X7oUN/EJEZQ41+xrz/QEb3axCrWR39HSVOjp9JUeGJuE96JZs2XRqMRAHDp0iWbzy9dumT5zmg04vLlyzbf37hxA0VFRTZtHB3D+hxCbay/d9UXR4SEhCAyMtLmH0F4I2LC9tUwA6mJMxPTkpFdnDpr27fXmklKbWd2tVHa1OiraSq8/T4Tf6FZTVliYiKMRiM2bdqElJQUAHWapl27dmH8+PEAgNTUVJSUlGDfvn3o1q0bACA7Oxtmsxk9evSwtHnhhRdQU1ODoKAgAEBWVhbatWuH6OhoS5tNmzZh0qRJlvNnZWUhNTWVuS8E4cuI8b1SywykJs5MTHq9zqkm7YleLdA/yag5k5S311xUw9Toi2kqvP0+E7Z4VFN27do1HDx4EAcPHgRQ51B/8OBBnD9/HjqdDpMmTcIrr7yCtWvX4siRI3j00UeRkJBgidDs0KEDBg4ciLFjx2L37t3YsWMHMjIyMHz4cCQkJAAARo4cieDgYIwZMwZHjx7Fl19+iUWLFtmYFSdOnIjMzEy8+eabOH78OGbPno29e/ciIyMDAJj6QhC+jKvwfnuszUDeAm9iGpLSBKlW/mDONGlLH+6KmffeYtNeC7jTmV0txJgaWfG1NBW+cJ8JWzyqKdu7dy/uvPNOy9+8oPTYY4/h448/xnPPPYfy8nKMGzcOJSUl6N27NzIzMxEa+tfiuGLFCmRkZOCuu+6CXq/Hgw8+iLffftvyvcFgwP/+9z9MmDAB3bp1Q2xsLGbOnGmTy+z222/HypUr8eKLL+L5559HmzZtsGbNGiQnJ1vasPSFIHwVZ75XzvA2M5AQ3uSs7QlndjVQw9ToyocQAGbdm6T4dVEqUMH+mL5wnwHX10eN66dVNJOnzB+gPGWEtyOmRiQArBrbkxyc3UzOmUKMWLbTZTut3xs1x+FOk59a5/KV++zq+mjFPOuu97dmfcoIwhO4c0fmjbs/XmO080whJqzcj5LrNQ7b6VBn3hNjBvLG66FFfMWZ3VW6EilzjMddmk+huqq8eVFOYIgv3GdX12dc30S8vzVfleunVUgoI4g/8YXdszsI0OvQq00sXn+wo9Ns5WLMQN58PbSGrzizq21qVDtNhdrmRW+/zyy5D5dtqy+Q8d97k3lWDJpNiUEQ7sSdDrO+4pyrVLZyX7keWsGXnNm9OSO+GoEK1nj7fWbJfegss4f19fOllCCkKSMUxRtNUO50mPUl51xAvhnI166HFvCUM7taeFOQhTVqmxe9/T4rZVbdmGfClK8O+oyWnYQyQjG81QTlzizfnsooriZyzEC+eD20AK9hsn8ejV7wPDrCGzPiu8O86M33WSmz6oc7ztb7zJt9zkgo81HcrbFS06FVbdzpMOsLzrlKQtdDPbxVw+QrqBmoYI233meWurN6HcBxjlPw6ADodI5NnN6sZSehzAdxt8bK201Q7nSY9XbnXKWJbRCiaDvCFm/UMPkK7jQveuN9Zrk+Y/vURV8KVdNwltDLW7Xs5OjvY3jCaVpth1a1cafDrLc75yoOqz+u9/rtEn6MNwcquANX12f6oCTB75/o1YLpHN6mZSdNmQ/hKY2Vt5ug3L2j9WbnXKW5Ul6laDuC0Breal50F66uj9D3u/OL8JEDfzJ7vM3qQEKZD+Epp2lfMMm502HWm51zlcYX5g5BuMIbzYvuxNX1cfS9u3z23A0JZT6EpzRWvvJwuHNHS7vnOnxl7hAE4V581epAPmU+hKe0DvzDAaCer5S3PRz8jmxIShOktmqkap/deS6t4ktzhyAI9+KLPntUkNyNqF3QtNbMofe8bJdah+1T+6nykvPWPGWE56G5QxCEVNyRAspdBclJKHMj7ripfPQl4Fidq/buwRsz+hPiUOse09whCEKrkFDmg7jrppLWgVALmlsEQfgjJJT5IO66qQBpHfwBqfdY6u+Eqja4SwtLEFLwt7XQ38brLtz1/qboSx+FQrB9G6kaK6m/8/aqDYR/4m+aXX8bry8iKvry2LFjmDVrFvr164dWrVohPj4enTp1wmOPPYaVK1eiqooSPBKE2kit2iCn2oO3V20g/A9PVDfxJP42Xl+FSSjbv38/0tLS0KVLF2zfvh09evTApEmTMGfOHDz88MPgOA4vvPACEhISMG/ePBLOCEIlXGmsgDqNVa1dlV6pv+Px9qoNhH8hZr7XmjnknCnE9wcvIOdMoeAzoGXkPt+EdmAyXz744IN49tln8c033yAqKkqwXU5ODhYtWoQ333wTzz//vFJ9JAjiT6RWbZBb7YEy7xPeBOt8X5x9Gl/sOe/15j5PVXPxFL7sN8cklJ08eRJBQUEu26WmpiI1NRU1NTWyO0YQRP3Fx1R6nel39horuZourWbe9+XFmZAO63xfsPFkvc94c583Ba54syZb7DPs635zTEIZi0Ampz1BEPVxtPjERAQz/dZeYyVX06XFkia+vjgT0pGjsfXGwBVv1WSLfYaFIsC9UZAWQnaZpY0bN2LWrFn44YcflOgPQRAQdtotLq92+jsd6hY1e40Vr+kSer0I/c4aLZU0Iadmwhmu5rsrvC1wRYnn292IfYb9xW9OlFD21FNPYcaMGZa/v/32WwwcOBDr16/HsGHD8NZbbyneQYLwN1gWH0e40lgNv62poOnR2e+sGZgcj+1T+2HV2J5YNDwFq8b2xPap/dwqkLlzcfYFJ3B/hKWmKgtaNPc5wttqyLp6hjkAL6zORfUNs+Vzf4kAFyWUbd68GX379rX8/dZbb+G1117D3r178fnnn2PJkiWKd5Ag/A1Xiw9PTIStm4CQxioztwC952VjwcZTDo8jVtPl6ULq7lqc+es2YtlOTPziIEYs24ne87JJC+clONPsTk5rw3QMrZn7nKElTbYrWNa4wvJq9Jy70fK8ebPfnBiYfMpeeuklAMD58+fx/fffIycnBxzHYc+ePejcuTNefvllVFZW4vz583j55ZcBADNnzlSv1wThw7AuKjPuuQXGyFCnDrJCPhg8k9PaIqNfa83soFlQe3GuNXNYnH3aZ5zA/ZmByfHon2S0OJLHRoQAOuDy1SrERAShqNxxUJqnAlfkYj9erQa/sD6bReU1lufNW/3mxMIklD3++OMAgKVLl6J///5ISUnBtm3bYDQaMW3aNHAch/Lycrz99tt4/PHHQZWbCEI6rIuKMTLUaXi7MxMBUPfi+WLPeWT0ay2+kx5EzcU5M7cAs9cehanMca5Fb3QC93d4zW5mbgGe+eaQSw2NFs19YvCGai5in82XfsjDlmfv1GQEuNIwmS+bN2+O5s2bo2fPnnjjjTfwyy+/4L///S/uv/9+NGvWDM2bN0d5eTkSExMtfxMEwYa931K35tGKOO3KMfNp2ZdKLadmXqsoJJDx+Irvij8h5FTuCCXMfVp+frSAmEAM/nnbd67Yq/zmpCKq9uWCBQvwyCOPYNy4cejduzdmzZpl+e69997Dvffeq3gHCcKXEQoJv69zPN7fmi8r/YRUM5/WU02okZ7DlVbREd7uu+IvsGiMYyKC8eLgDjAawmSb+7T+/GgB62eYlctXKzEkpQnefbhrvetr9KHrK0ooa9GiBbZt2+bwuw8++ECRDhGEv+As5877W/Mxrm8i1h4qkLz4SDHzeUseIN6pWanFmTW4whpv913xF1g0xoXl1TAawmSb/bzl+dEC/DP8/OpcFLlI9QP89bx5i9+cVEQJZQRBKIOrkHAdgLWHCrDl2Tux71yxpMWne2IMjJEhguY4ex8Mlj55ypfKUdZvJRdnMVovX/Fd8RfcFbVXfcOM51cf0eTzo1UGJsejX/s49Jy7UVTQhTf4zUmFSSh7/fXXMXHiRISFhblsu2vXLly5cgWDBw+W3TmC8FVY/b32nSuWvPhk5ZlQaZXnxxpHZj6t1s9zZQ5Soi9itV6+4LviL7gjai8zt+BPjY9wiUFfqz+pFMGBerx2f0eLKVML1UI8CZOjf15eHpo1a4annnoKP/74I/744w/Ldzdu3MDhw4exZMkS3H777Rg2bBgaNmyoWocJwlMo6byr9u6dN6OUVDh+SRjCg+qZUjyVB8jZdXVX5n5Wx+N4DeZ8IpyjdrZ7fo6ymOAA8kV0hDflWFMbJk3Zp59+ikOHDmHx4sUYOXIkysrKEBAQgJCQEFRUVAAAunTpgn/+8594/PHHERpKvhaEb6G0866au3cWp/WwoAD0TzK6rU9COLuu/ZOMbjOnOgse4Jmc1gYZ/dr4zY7dV1CzbquUABHyRXSMr/uKsaLjRCYVM5vNOHz4MM6dO4fr168jNjYWKSkpiI2NVauPPkNZWRkMBgNKS0sRGRnp6e4QjAg57/JLhZSdXK2ZQ+952S5z7myf2k/0opRzphAjlu102W7V2J42ZhQ1++QIV9d1Ulpbhwlc7bEfh9w+UeScb6LGvWV91gDlnx/Cvbjr/S3a0V+v1yMlJQUpKSkqdIcgtIVazu9q7t6lmiHV7JM9LNd1+S/5TMdS0hwkZ7fuKBiBXr7aQQ1NjNi550++UYQ0KPqSIJygpvO70mkdeOSYIdXqkz0s11XIH84epc1BUiK7SMPmHSgdtcc692IigvDa/R1pLhAuIaHMD6AdvHTUdn5n2b2LvX+8Y7PUciTu8O1gvV5RYUEovV6j6bIqlJvKf3H1rAFAo4hg5Ey/C8GBTHF1hJ9DQpmPQzt4ebjD+d3Z7l3K/VPCDKl2HiDW6zW6Vwss3HhKdXOqVLSc241QH5Zn7dX7k0kgI5jR9Eypra3FjBkzkJiYiLCwMLRq1Qpz5syxKXjOcRxmzpyJ+Ph4hIWFIS0tDadOnbI5TlFREUaNGoXIyEhERUVhzJgxuHbtmk2bw4cPo0+fPggNDUXTpk0xf/78ev35+uuv0b59e4SGhqJjx47YsGGDOgNXCHelE/Bl1A6nd4ac+6f1EHPW65rRr42mxyGnvijhG2j9WSO8C8mastOnT+PMmTPo27cvwsLCwHEcdDpld4Lz5s3Du+++i08++QS33HIL9u7di9GjR8NgMOA///kPAGD+/Pl4++238cknnyAxMREzZsxAeno68vLyLKk5Ro0ahYKCAmRlZaGmpgajR4/GuHHjsHLlSgB1URUDBgxAWloali5diiNHjuCJJ55AVFQUxo0bBwD45ZdfMGLECMydOxf33HMPVq5ciaFDh2L//v1ITk5WdNxKQDt417CYBd3p/G7fN7n3T8sh5mKuq5bH4ancboS20PIcJbwL0SkxCgsLMWzYMGRnZ0On0+HUqVNo2bIlnnjiCURHR+PNN99UrHP33HMP4uLi8OGHH1o+e/DBBxEWFobPP/8cHMchISEBTz/9NJ555hkAQGlpKeLi4vDxxx9j+PDhOHbsGJKSkrBnzx7ceuutAIDMzEwMGjQIv//+OxISEvDuu+/ihRdegMlkQnBwMABg2rRpWLNmDY4fPw4AGDZsGMrLy7Fu3TpLX3r27ImUlBQsXbrUYf+rqqpQVfVXiZuysjI0bdrULSkxpKZF8BfEmgXdbQb2l/vn7eZ1f7lPBOHvuCslhmjz5eTJkxEYGIjz588jPDzc8vmwYcOQmZmpaOduv/12bNq0CSdP1uUqOnToELZv3467774bAJCfnw+TyYS0tDTLbwwGA3r06IGcnBwAQE5ODqKioiwCGQCkpaVBr9dj165dljZ9+/a1CGQAkJ6ejhMnTqC4uNjSxvo8fBv+PI6YO3cuDAaD5V/Tpk3lXA5R0A5eGClmwYHJ8dg+tR9Wje2JBf/ojBmDO+C5ge1hCAuWldlfCH+5f9bXddHwFKwa2xPbp/bTjEDmqopD98QYRIUHOT1GVHiQx4MRCILwDkSbL//3v//hp59+ws0332zzeZs2bXDu3DnFOgbUaavKysrQvn17BAQEoLa2Fq+++ipGjRoFADCZTACAuLg4m9/FxcVZvjOZTGjcuLHN94GBgYiJibFpk5iYWO8Y/HfR0dEwmUxOz+OI6dOnY8qUKZa/eU2ZO/BEdnZvQI5ZMECvQ+n1asz/6YTqmh1/un9aLS6slBaPDFgEQbAiWlNWXl5uoyHjKSoqQkhIiCKd4vnqq6+wYsUKrFy5Evv378cnn3yC//u//8Mnn3yi6HnUIiQkBJGRkTb/3IUnHdS1jBzHbHcGTtD98yys93p3fpHLfGrFFTXk6E8QBBOihbI+ffrg008/tfyt0+lgNpsxf/583HnnnYp27tlnn8W0adMwfPhwdOzYEY888ggmT56MuXPnAgCMxrrafZcuXbL53aVLlyzfGY1GXL582eb7GzduoKioyKaNo2NYn0OoDf+9Fhl+WzPB/E6A59MJeAKpZkFXGjagTsOmlCmTd4QH6mta/Pn+uQMx95p1Pu04fUUVM7ev4cpc7C3w41i9/3d8uO1XrD7ANh5fGT8hHdHmy/nz5+Ouu+7C3r17UV1djeeeew5Hjx5FUVERduzYoWjnKioqoNfbyo0BAQEwm80AgMTERBiNRmzatMlS9qmsrAy7du3C+PHjAQCpqakoKSnBvn370K1bNwBAdnY2zGYzevToYWnzwgsvoKamBkFBdf4hWVlZaNeuHaKjoy1tNm3ahEmTJln6kpWVhdTUVEXHrASOzC7WKJ2d3ZuQahZUM7O/EO7Krk/YIuZes86nxZtP49v9v9N9c4K3B33wOFt/tRRMRGgT0Zqy5ORknDx5Er1798aQIUNQXl6OBx54AAcOHECrVq0U7dy9996LV199FevXr8fZs2exevVqvPXWW7j//vsB1GnpJk2ahFdeeQVr167FkSNH8OijjyIhIQFDhw4FAHTo0AEDBw7E2LFjsXv3buzYsQMZGRkYPnw4EhISAAAjR45EcHAwxowZg6NHj+LLL7/EokWLbPzBJk6ciMzMTLz55ps4fvw4Zs+ejb179yIjI0PRMctFyOzCMzmtjaYcqd2NVLOgpxzvPeUI7887djH32tV8sobyAwrjKzkVXa2/BQLj8ZXxE/KRlKfMYDDghRdeULov9fjvf/+LGTNm4KmnnsLly5eRkJCAf/3rX5g5c6alzXPPPYfy8nKMGzcOJSUl6N27NzIzMy05ygBgxYoVyMjIwF133QW9Xo8HH3wQb7/9ts14/ve//2HChAno1q0bYmNjMXPmTEuOMqAuEnTlypV48cUX8fzzz6NNmzZYs2aNpnKUOTO7AHUCxxd7fkNGvzbu7JamkJp3zJOO9+52hPf3HbuYe83Ppyc/3++yPeUHdIyv5FR0tf5aYz0eXxk/oQyi85QtX74cDRo0wN///nebz7/++mtUVFTgscceU7SDvoTaeU4oZxI7YgWPWjOH3vOyXdaT3D61n1cvnEJ1HPkR+UOGcrH3OjO3gEkos4aewb/wlXWLdRw8/Hh8Zfy+jrvylInWlM2dOxfvvfdevc8bN26McePGkVDmQfwlt5USiM3A7anM/u7EG3fsYou1syDmXvPXTCz0DP6F2uuWGnPEEWL7x7endbsOd90nrSNaKDt//ny9nF4A0Lx5c5w/f16RThHS8KfcVkog1izo6473nghmkIOaZlbWe+3qmglBz+BfqLluudMUL7Z/fHtat8llwhrRQlnjxo1x+PBhtGjRwubzQ4cOoVEjzy/U/gzvdOzK7EK5raQjpGED6swX3rzL86Ydu5CZlXeMVsLMyqJNFXst6Bmsj1rrljvmiDX8OFwJ6ToAMRHBMJVeR86ZQnRrHu3Rdbv6hhmf5ZzFuaIKNI8JxyOpLRAcKDoGUDLuvk9aR7RQNmLECPznP/9Bw4YN0bdvXwDAli1bMHHiRAwfPlzxDhLs+IOJTQvYa9h8ZZfnLTt2d5pZXWlTxVwLegYdo8a65QlTvPU4nDlqcwAKy6sx+atDAOrWivs6x+P9rfluX7fnbsjDsm35sA6ufnXDMYztk4jpg5IUP5893ugyoTaixeE5c+agR48euOuuuxAWFoawsDAMGDAA/fr1w2uvvaZGHwkR8GYXo8H2ZWE0hPrdjsMd+FIoO0sdx2gN1HGUU5VBacSkxKBnUBil1y1PzRF+HPEGdmHdVFqJ97fmY1zfRLeu23M35OG9rbYCGQCYOeC9rfmYu0G8r6RYtPQsawXRmrLg4GB8+eWXmDNnDg4dOoSwsDB07NgRzZs3V6N/hATEOrET0vDHXZ6zYG2tOVS7w8zqTMvDM6ZXC6QlGekZdIGS65Yn54j1OEyl11FUXg1DWBBeWpeHq5U36rXn14q1hwqw5dk7se9cserPUPUNM5Zty3faZtm2fDw9oL2qpkwtPctaQVKeMgBo27Yt2rZtq2RfCAXRapFnX8LbHONdwVLHseT6DSzOPo2Jaba57rToUO0uM6tQUIA3mrA9jVLrlqfniPU4MnML8PzqIw4FMh5+rdh3rtgta8VnOWfracjsMXN17cb0aalaPzx9n7QIk1A2ZcoUzJkzBxERETZZ7h3x1ltvKdIxgvAEYrQ9vrbLY+3ngo0n0c7YwCJsiHXUlatR00pAi/043KXlIFzDMkfiIkNg5jh8f/CCavdL6NkQwl1rxbmiCkXbSUUrz7KWYBLKDhw4gJqauh30/v37odM5nrhCnxOENyBW2+Nruzwx/eTNsvz/s5pwldCoaSGgxdk4hqQ0Ue28BBuu5ggHoPKGGaM+2GX5XGnNppgM/zzuWiuax4Qr2k4qripicPC/wBgmY/HmzZsRFRUFAPj555+xefNmh/+ys7PV7CtBqIYUh32pdTS1Cj8eFnizrBgTrpJBEZ4MaPGl4A5fRmiOGP4MZrE31St9/8TksHP3WvFIagu4knP0urp2hHsR5VNWU1ODsLAwHDx4UFM1HwlCDlId9rWgsVESMXUcAXGmFlPpdcz/6YTkoAhHJk9PBLSImSsAKNhGAHcFhdjPkdiIEDz99SEA9X0nlQzOqTVz2HH6iqjfuHOtCA7UY2yfRLy3VdjZf2yfRNXzlbmqiOGLwVKuECWUBQUFoVmzZqitrVWrPwThduQ47Ptalv+ByfGYnNYGCzaectlWjKmlqLxa8jV2ZfJ0ZxAF61xZnH0aX+w5T47/DnB3Xj9rp/ucM4UwlakbnONofM5oFBGMV+9Pdvu84POQ2ecp0+vgtjxlvhYspQSioy9feOEFPP/88/jss88QE+MdZhmCHX+sPybXYV+uxkZr1zyjXxus2v2b4MvL3vmWxVE3pkEI07ntr7HWsn2LCYawx18zlFvj6fupdnCOWMf+mIgg5Ey/y6KRcvdaMH1QEp4e0N5jGf19LVhKCUQLZYsXL8bp06eRkJCA5s2bIyIiwub7/fvZTB+E9vCVzPRiUcJhX2oovxaveYBeh9n31ZllAddmWRYTriEsmOnc1tdYi3ng5Dhi+2ruOla0cD/VDM4R49jPj+61+ztaBCBPrQXBgXpV0144w9eCpZRAtFA2ZMgQirL0QTy9g/UkngrL1vI1F2OWZWlba+ZEX2MtmTZ4DYaprBIRIQEor5LmwuGP5hgeLdxPNZ91MY799s+RltcCNeGriDjLj6iFKiLuRLRQNnv2bBW6QXgSLexgPYknHPa94ZqLMcu6aivlGmvFtCHWR4gFfzLH8Gjhfqr5rLP2O+POVpjcv53lHN6wFngSMSlFfAFmw3F5eTnGjx+PJk2a4KabbsLw4cPxxx9/qNk3wk1Q/TH3p1jwlmvOm2WHpDRBaqtGTl8KrtqKvcZaMG0Ipb+Qiz+ZY3i0cD8B9Z511n73an2TzbPhLWuBGjBVEamo8cmxC8GsKZsxYwY+++wzjBo1CqGhoVi1ahXGjRuH1atXq9k/QiWsHUpPXbrG9Btf3927M8WCFrQGnkDMNfZ0VnYpyT9d4Y8Zynm0lL1djWdd6vi8YS1QKwDBG8bubpiFstWrV2P58uX4+9//DgB49NFH0bNnT9y4cQOBgZJLaBIeQKo5xh929+6qGaoVrYEnYL3Gns7KLsZHyBG+kLtOSbSW10/pZ13q+LS+FqgZgKD1sXsCZvPl77//jl69eln+7tatG4KCgnDx4kVVOkaogxRzjLdlpvcG5FQDqDVzyDlTiO8PXkDOmULUuqos7MV4Miu7lN05f9+WjPRMtQGt48lKDO5Ayvi0XBlE7eoV3ZpHIyZCODLbH989zCous9mMoKAg2x8HBlIiWS9CijnG33f3aiF1V63FFBpq46ms7GJ359b3bWByPNKT3VttwFvwRCUGdyJ2fFqt/6h2AAK/lhWVVzv83l/fPcxCGcdxuOuuu2xMlRUVFbj33nsRHPyXpEt5yrSLFHOMt2am9wbEVgPw17B5wP1Z2QHXPkL22N83d5nCvRFfvzZix3fgfLHL7939bKuZwoQlya6/vnuYhbJZs2bV+2zIkCGKdoZQFzEh223iGvrcDlaLDEyOR7/2cS4zaquxa3VH9nA1zsE6j50JbiywaDMnpbVFi9hwzT0rWqsSoWU8fa2qb5ixbJtwDUqgrhTS0wPauy3TPqCeEz6LxSYmIghbnr3TrePVCrKEMsK7EBOy7cu7WC3hyBz5wfb8ejtEpXet7jCDqnUO1nk8Z91RhAXpZZ3LG2ub+qOJWypauFaf5ZyFK7dQM1fXzp2Z99Vywmex2BSV12DfuWK/fA/5nxjqx2jZodQbUNrBXowTrZK7VrWdd9U+B58F3BVF5TWiziV0fwcmx2P71H5YNbYnFg1PwaqxPbF9aj+XL21PBGS44976CmpeKzH3/lxRBdMxWdsphav3BQBEhQXBzHGi5jbrWrbj9B9+EcxkD+Wy8CO0FpLuTSi9oxZrjlRq1+qO7OFay1DOci5X91esj5AnNDBau+5aRs1rJfbeN48JZzouazulcPa+4Cm5XoNRH+wSNbdZ17LFm89Y/t+fNL2kKfMzfD0kXQ3U2FGLzeKtlJbTHdnD1T4HSxZwMedS+v56Slvlz5nhxaLWtZJy7x9JbQFXcp9eV9fO3Qi9L+wRM7dZNHByju/tkFDmh0g1x/gjrnbUQN2OWqx6Xaw5kt+1Aqi3mInRcrojg7ba55DyO6HfKH1/1ZovLCh13WvNHHacuoL/++k4/u+nE9hx+orPmY/UmKNS731woB5j+yQ6PfbYPomSnd7lmtH598WKMT0QFebYbUDM3Ha2lgmh9rOjJWSZLysrKxEa6j+Zdn0JXw9JVwq1wsKlmCOVcDp3RwZttc8h5XdCv1H6/qqZRsAVSlz3zNwCTPvuiI0mcvHm04gKD8LrD3T0mY2bGnNUzr2fPqhOSFm2Ld/G6V+vqxPI+O/FopQZPUCvg16vQ8l1YQ21mLkttJY5Q81nR0uIFsrMZjNeffVVLF26FJcuXcLJkyfRsmVLzJgxAy1atMCYMWPU6CdBeAQx6RdyzhQyh9XzzurOzHBR4UH1zJFyE2+6o/6g2ucQkz/M1bnkaEwcpVLwZC0/udc9M7dAMIFpSUUNnvx8P5b6gItDrZmDmeMQFRYkKGRImaNy7/30QUl4ekB7l+lxWFEyr2GtmcOO038wtWW9DvZr2alL17B482nFju+tiBbKXnnlFXzyySeYP38+xo4da/k8OTkZCxcuJKGM8CliG4QwtXv5h6MothKwlHBMFRKz5Gg53RHsofY5WByQWc8lVWMipIEYflszScdTAjnXvdbMYfbaoy7P4e2BAix1f6XOUSW0b8GBekXSXigZyCC2VrKYuW2fFJpFKPP1OpiiRfBPP/0U77//PkaNGoWAgADL5507d8bx48cV7RxBeBxG94VikTUYWZzViytqVHHKdkewh7NzvDOyCwxhwbLC3VkckFnGIyWAwpkz98KNJxEVHuSxtDNS7+3u/CKYyqpcHt+bAwVY6/5KfQ60lHJIqUAGMbWS5Y5PS9fPk4jWlF24cAGtW7eu97nZbEZNDVtEFEF4C1fKXb+oHOFqN+pJMxfgnvqDjs5RXF6NOesd+7iI7Y+jmpjQAVeuVdX7vVDWdrHaJRYNhM7q/z2RdkbKvRUzz7RkPmLNxs+SRT4qPAjvjOiKnq0aSbo/Wko5pMT6IqZWstoacH9K2SRaKEtKSsK2bdvQvHlzm8+/+eYbdOnSRbGOEYQWkKMqd+aY6g6He1e4I9jD+hyZuQWYsNKxj8uTn++v52PHYgJmGYMrZ2cxARQsGojiihpMTmuDL/b85rEqAGLvrZh5phXzkRgndpYs8iUVNdD/KahLRSsVIJRYX8TUSlZqfFq5fp5EtFA2c+ZMPPbYY7hw4QLMZjO+++47nDhxAp9++inWrVunRh8JL8LTdeSUhsUh3xWOdqNqOcNr9fqzpAuwv8ZKFFpndXYW0i7Vmjl8uO1Xi+O1UEoAe1rERmD71H6avBeO6J4YA2NkiEsTplbMR0L3teBPAX/JyK4Y1OmvOeNOzbQ7tNCukLK+2K8drLVjM+5sjcn92yo2Pi1cP08iWigbMmQIfvjhB7z88suIiIjAzJkz0bVrV/zwww/o37+/Gn0kvAQt1JHTIo52o2qo6rV8/cXsunnkZlYX6+xsr12auyGvXooCHWMXGjcM9aq0MwF6HWbfd4tg9CWPFsxHLGa1jFX7sRhdMKhTAgD3a6Y9fe/Fri+O1o6YCLYNSK/WsYrPCU9fP08iKda2T58+yMrKwuXLl1FRUYHt27djwIABSveN8CJ8teaemOzx9rhyTFXS4V7r11+qBkJOFno5zs5zN+Thva359QpFcy4cbLzZGXlgcjyWPtzVYV3R6PAgzaTDYBHwzRzw1MoDlnnvj07krOuL0NpRVO583fPFa6YFqPYlIRtfqrknVYVvD6u2SwlVvTdcf7kaCCWz+Ltqd726Fu9vyxd9Pl9wRubn484zhcj59QqAOo1Fz5bSnN/VQMxcsJ73/uhE7mp9YXXm96dr5mmYNGXR0dGIiYlh+qc0Fy5cwMMPP4xGjRohLCwMHTt2xN69ey3fcxyHmTNnIj4+HmFhYUhLS8OpU6dsjlFUVIRRo0YhMjISUVFRGDNmDK5du2bT5vDhw+jTpw9CQ0PRtGlTzJ8/v15fvv76a7Rv3x6hoaHo2LEjNmzYoPh4vRFfqbmXmVuA3vOyMWLZTkz84iBGLNuJOetc524CgJiIYJu/xWi7eFX9kJQmSJUQ+eUN119KvTtrlMzi76xdZm4Bur+W5VIjBgANQwNs/vaV+rEBeh16tYnFM+nt8Ux6O1XMU3IQMxes572/1v11tr6wuhVEy1jfCHEwacoWLlyocjccU1xcjF69euHOO+/Ejz/+iJtuugmnTp1CdHS0pc38+fPx9ttv45NPPkFiYiJmzJiB9PR05OXlWUpAjRo1CgUFBcjKykJNTQ1Gjx6NcePGYeXKlQCAsrIyDBgwAGlpaVi6dCmOHDmCJ554AlFRURg3bhwA4JdffsGIESMwd+5c3HPPPVi5ciWGDh2K/fv3Izk52f0XR0N4Or2DEgg5DrOo8I2GUGx59k7sO1fsEcdUb7j+rAlfHSHVRCLW2VloDggxNOVmDOoY75fOyEojJkCFv6+sPorW895dTuRaDbixh3VNmDG4A4yGMM2PxxdgEsoee+wxtfvhkHnz5qFp06ZYvny55bPExL8Kt3Ich4ULF+LFF1/EkCFDANQlt42Li8OaNWswfPhwHDt2DJmZmdizZw9uvfVWAMB///tfDBo0CP/3f/+HhIQErFixAtXV1fjoo48QHByMW265BQcPHsRbb71lEcoWLVqEgQMH4tlnnwUAzJkzB1lZWVi8eDGWLl3qrkuiSbSQ3kEOSqjwgwP1HnNMZb2uZ69UqNwT50ipdwdIN5HwgqCQ8zpndWwxOZl4WjQK91tnZCURG6Di6r7aY/98qO1EruWAG3tY1w6jIYzmupuQVlTrTyorK1FWVmbzT0nWrl2LW2+9FX//+9/RuHFjdOnSBcuWLbN8n5+fD5PJhLS0NMtnBoMBPXr0QE5ODgAgJycHUVFRFoEMANLS0qDX67Fr1y5Lm759+yI4+C8VbXp6Ok6cOIHi4mJLG+vz8G348ziiqqpK1eujFbzdidbbVfispsGFG0963OF/YHI8tk/thxX/7OEyvYReBywZ6Z7rKzY6VK8DHkltoV6H/ASpASoDk+OxZGRXOJPVPbHuaD3gxh5vX7t9EdFCWXl5OTIyMtC4cWNEREQgOjra5p+S/Prrr3j33XfRpk0b/PTTTxg/fjz+85//4JNPPgEAmEwmAEBcXJzN7+Li4izfmUwmNG7c2Ob7wMBAxMTE2LRxdAzrcwi14b93xNy5c2EwGCz/mjZtKmr83gK/cwXq12v0BodQMSr8VWN7YtHwFKwa2xPbp/bzuEAG/HX9WbQ8L/2QJ6mskZIE6HXo1ToWrz/Y0ZIB3xGLR3SxyTUlFl77JQQfAFFr5kSbdsf2SZRcKJqogyV3nbP5OqhTPBaPcJyw3BPrjqvxcACmfXcEO05d8fgzyOPta7cvInpVee6555CdnY13330XISEh+OCDD/DSSy8hISEBn376qaKdM5vN6Nq1K1577TV06dIF48aNw9ixY73GXDh9+nSUlpZa/v3222+e7pJqeLMTrVgVvlSHfDUZmByPyWltnLbRgsO/NUJzJt4QiqUPd7XkmJKKmAAI1jmg1wH/6puI6YOSZPWNUCZAZVCnBCx9uCviNbDusFYNGPXhLvSel60ZrZk3r92+iOiUGD/88AM+/fRT3HHHHRg9ejT69OmD1q1bo3nz5lixYgVGjRqlWOfi4+ORlGS7+HXo0AHffvstAMBoNAIALl26hPj4vybOpUuXkJKSYmlz+fJlm2PcuHEDRUVFlt8bjUZcunTJpg3/t6s2/PeOCAkJQUhICNNY3YHazqfemolZrez6SiDmnrWIjWA6ppYCLtScM2ICIO7plOB0DgB10Za7n++PsOAAgRaEGJQKUGFJ+1BXcL0SRdeqEBMRDKMhTPG1ScxzpUS1CiWxvoam0usoKq9GTIMQGMKCLVo9b1vXvRXRQllRURFatmwJAIiMjERRUd0upnfv3hg/fryinevVqxdOnDhh89nJkyctdTcTExNhNBqxadMmixBWVlaGXbt2WfqSmpqKkpIS7Nu3D926dQMAZGdnw2w2o0ePHpY2L7zwAmpqahAUVOfnkpWVhXbt2llMsqmpqdi0aRMmTZpk6UtWVhZSU1MVHbNauMv51BszMWs1h5HYexbbgG0DkP9HuWJ9VAK15oyYABSWOfDGQ51JIFMQ1sATlvsoNIccPUM8Sq9/YgKZtJI/0JoAvQ6l16sx/6cTNteLTyYsti4tIQ3R5suWLVsiP78usWL79u3x1VdfAajToEVFRSnaucmTJ2Pnzp147bXXcPr0aaxcuRLvv/8+JkyYAADQ6XSYNGkSXnnlFaxduxZHjhzBo48+ioSEBAwdOhRAnWZt4MCBGDt2LHbv3o0dO3YgIyMDw4cPR0JCnXlk5MiRCA4OxpgxY3D06FF8+eWXWLRoEaZMmWLpy8SJE5GZmYk333wTx48fx+zZs7F3715kZGQoOmY18DbnU0+gNRW+pHvG6KaycNMpv7jnYp2YtTYHfJnM3AIs3HjSZbvo8CDJGmqhZ4inQOH1T2wuPq25Ewhdr5KKGsG6tP6wjrgbHcexpEn8iwULFiAgIAD/+c9/sHHjRtx7773gOA41NTV46623MHHiREU7uG7dOkyfPh2nTp1CYmIipkyZgrFjx1q+5zgOs2bNwvvvv4+SkhL07t0bS5YsQdu2bS1tioqKkJGRgR9++AF6vR4PPvgg3n77bTRo0MDS5vDhw5gwYQL27NmD2NhY/Pvf/8bUqVNt+vL111/jxRdfxNmzZ9GmTRvMnz8fgwYNYh5LWVkZDAYDSktLERkZKeOqsFNr5tB7XrbgwsSb5rZP7aeJ3ZqnYTEXOmujhIlY7D3jz/ljbgE+zTnHdI54Be+5lnMy8S8awLH2y5GwpeXx+AKu5rc1UeFB2Pdif8WfIWuUfBaE5pszFg1PwZCUJrLPLQcx14vH394d7np/ixbK7Dl79iz279+P1q1bo1OnTkr1yyfxhFCWc6YQI5btdNlu1dieXmd29ATOTIoAFDERi7lnpderRef9sv693HvuDTmZvKGP/gTr/OaRMk/dcQ4hHBWyd9e5pSL2elmjhf67A3e9v2XXvmzRogVatGihQFcINfCGbO/eglDGd1NppWAiSykOvaz3IivPhOU7zopKeCrlPEIIXY8CNzoxs2i1KIu7ttiYJ5xGyBFq1j2Vcw5HZOYW4P2t+UzPpCcDiOyRM356dygLs1CWk5ODwsJC3HPPPZbPPv30U8yaNQvl5eUYOnQo/vvf/2oq2pDw/mz7WoElp5IjpDj0st6LNQcvShbIxJzHEa4y4HNQ34lZjAaMsrhrg1ozh9UHL4j6jZp1T+Wcwx4xVSG0lgNMzvjp3aEszI7+L7/8Mo4e/as485EjRzBmzBikpaVh2rRp+OGHHzB37lxVOklIhzI2K4PYjO/WiHXoZblnMRFBKCqvltQfJe45y/VQ04nZXcErtWYOOWcK8f3BC8g5U+gw6ScF0rCzO7/IZT1Za+TWPWURd5Ra/8SsEVoLHhEbpADQu0MtmIWygwcP4q677rL8/cUXX6BHjx5YtmwZpkyZgrffftsSiUloB8rYrAxKqOhZj8Fyz+6X4RhsXfNRKqbS64q2E4PcTPCsZOYWoPe8bIxYthMTvziIEct21kv66a6++ApiniMd5Nc9VfMc9rCOLePO1pqpBsLjbM0RQol1hKgPs1BWXFxsU2Zoy5YtuPvuuy1/33bbbT6dsd6boVB/+SihohdzDFf3LC1JOGmxO2DV0knV5jlDiUzwrmDVfrmjL74E6zMQExEke23inyH7bP888Qqvf6xj69U6VpOCjNCaQ7gXZp+yuLg45Ofno2nTpqiursb+/fvx0ksvWb6/evWqJfEqoS1qzRwMYcF4bmB7VTNa+zKusv47Q6pDrzMH9VozJ7k/ADB77VEmfy8h5/UYxkS1rO3EnJ9VI/HVnvP4MbcAzWPC8UhqC+ZaldU3zHh+da6g9svaR5ACacTB8hw1ighGzvS7FKktapOpXuWM/ixjiwoLgpnjUGvmNLn22l+vOeuOCpqbtZb81ldgFsoGDRqEadOmYd68eVizZg3Cw8PRp08fy/eHDx9Gq1atVOkkIR1nDsj0ILHjKuM75+D/+b8BNjW/kADkyEHdWX9YMJVVYXH2aUx0Ui/T2dwxRjLWC2VsJ+b8w29ryvT71QcvWv7/1Q3HMLaP65qVmbkFeH71Ead+T1JqZpIzdB0slRNevT9Z0WLv7qoywo9NKBIbAEqu12DUB7s0HQTCX6+cM4XMz4E/pMRwF8wzf86cOQgMDMTf/vY3LFu2DMuWLUNwcLDl+48++ggDBgxQpZOENMgBWVmcmRSXPtwVS2WYiFn8l1j7w5dFccWCjScFj+9q7hSXVwmahXiEnIDlOs8v2HgKESLLHZk54L2t+Zi7IU+wDX9OVkf0y1crKZBGAlpxp2CZh2rhDWswaYE9g+jksaWlpWjQoAECAmwXxaKiIjRo0MBGUCNscWfyWMrkrx5KZ/QXyvflLOu8s/6YzRxGfbiLaSyOspmzzp0Zg5MwYaW4bPksqSNYsotL0Q4CgF4HHJ9zdz1NjJSM5nzSTClVAwjP5nVTI4WJ2Dmk9TWYEo/b4q73t2gdscFgqCeQAUBMTAwJZBqCHJDVg1fvD0lpgtRWjWwWVGffOUKJ6D37c/Zs1cilFovH0RxgnTvREcGiNB5KOc/zfZCCmQM+yzlb73Mx6QyoZqYyiH1WlEItC4LYtDlaX4O7J8a41LpHyahNSjhGdkZ/QpuQ6tk7ECM8s+5GA/Q6zBjcAU+tPMDU3n4OiJk7Q1KaOHSkNoQF2zgzuxI+pTjPS+VcUYXDsYjB3kfQXVUDCHmImYdi753UeevpNViOxlLt2e2PVTJIKPNRyAHZO1BDeM7MLcCc9ceY25+6dBU5ZwotC97ZK+VMv+PnToBeh9Lr1ZifeVzQHCRG+FR7TjaPCa/3Ges5G0UE49X7kx1qv9zlUE5IR41NEI/UeevJNdiZGdcQFoySCuf+lcUVNao5+vtrlQwSynwUXvXs7KEi1bPnUVp4FvJPc8bizWewePMZxBtCcV/neLy3Nd9pe/sUH85qgvI1MKtumJn6cvlqJe7plOAytYBeB3CceDOmXgc8ktrC5rNaMwezmUNUWBBKrgs/LzERQYqlapCLt2oQ+H6rnZ5CCDUtCGLT5ni69qWr5/aJXi2YjqOGpo9lTfFVwYyEMj9G+0u47+NqIRezcIupvecIU2mlS4GMhzffsZqD/u+hzkzHbdwwlCltwtg+iXh/a75oh/+xfRJthCpHu3F7+HO+dn9HTQhk3qpBcHat3dV/NS0ILCkxeDxdTYXluWWtUaq0pk9NE7M34PkVhlCF3flFzKpnwnMoWQZLTn1OgF24mZTW1vLyZDUHQQdRqSNcOc9PH5Tk8Pt4Qyj6JzWG/eXS64B/9bXNUybk8G2Plhz2vTXNjatrXeCm/mslhYmn5xTLc1tUXoOYiGC3Xyt/D1IjTZmPwqpS/vHPRdBbzB++CC+A2GsRjCK1B+5yGG4R+5dPFus5r1yrcqn9Eus87+z76htmfJZzFueKKhxm9GfRKkaFBeGdUV3Rs6VwZKCr8yiJt2oQWDW4HJTtv5CJV+w8FHO+l34QzoMHsM0pd8D63A5NScDyHWcVv1ZK9M3TARJqQUKZj8KqUv405xw+zTnnFeYPX0aJ6D13OQxbn0eMOSi1VSPRwqcr53mh74MD9RjTp6Xg71i0iiXXa6DX6QTvwdwNeVi2LR/W2UpYKwdIQU0ndTXghaIdp/9g1uAq1X9XJl4lNkH2KDGnlMSZ3yHrc9s/yYjuiTGKXytn+HuQGgllPopYp1N/cKCUi9rO1XKj91j802IiglEoo0i4fXCIWJ84raSOELsbt7/32ccvYdm2+v53fOUAAIoLZt6kQWDx1RNCbv9dOYm/M7ILoiNCFK8FrKX740ooFRMIFqDXufWZVdLP1hshocxHEVsbUcvmDy3gDc7VLKaZOUOSMWd9nuRC5vazwpVzM4f6Jg4tpI4QsxuXImAs25aPpwe0V9SU6S0aBCkRwNbI6T9LMuaMVQdstJtK1QLWyv1hiVzsn2R0eRzrq+HOZ1ZNE7M3QI7+PoyQs7QQvu5AKRVvcq525SA/qFO8YGABC74SHMLq8F1cXs0UDGCPUOUAOWjFSd0ZciOA5fafxYRoXxxDqedYC/eHtULIzjOFmg4E8+cqGaQp83GszUU/5hbg05xzLn+jBfOHVmBZ5KZ9dwQNQ4LQU2KpGCXNorVmDoawYKemGSGfGlfmDB7r+eHKuVmM9tWdubdYduMzBnfAnPXSBQxHlQPk4A0aBDkRwDrI77+UtUspK4H1/RFC7fvD6neY8+sVpuN58l2gFVcHd0NCmR9grXpmEco8bf7QEkzOuxU1GPXhLknmTCXNos6OZb+QOVrwWAuZn73yl7ChlPO5J8zDrhy+DWHBslKMOKocIBe1nNSVQupLXKl7bT03xaBUkMTA5HiM65tYLwBEr6vLkaf2/WG//myCjaffBVpwdXA3JJT5CCxaBlYHSrOZw/cHL/jNzsQZYl4yjoIlnN0XJbNWCx2roLQST36+H0tGdsGgTgk239kveLVmjik4ZOHGk2hnbICByfGKODe76vvktLbI6NfakqxWyZ2ztXBqX7vTVCZdIHNUOUAptKxBYH2JT7izFWLCgxXN6L/hcAEWbDwp6xim0uuyfp+ZW4D3t+bXm8vcnwEgNbWcJaJRjfvFev1TWzXCt/t/17QzvbdWrZALCWU+AKuWwZX5gwNwvabWRluiNWd2dyNmp2hvBsnKMwnel/5JRsVyTrH48WSsOoDF0GFQJ+H7KCYjOd+32AYhLtsCEGzH0vcFG09i1e5zGJKSgLWHChTXpgnV7oyJCJJ8TPvKAUqjVQ0C68ZvSv92ir5gNxy+iIxVB2QfZ876YwgLDpA0n1hcHT7acRYf7Tir2rrKev17tmykaVO4NwRWqQU5+ns5Yp3QhRwoo8LrXkD2PkVadGZ3J66cd+3hzSCLs087vS+Ls08rlrWa1bn5qZXK3EebvrE6XAm0Y/VBMpVV4b2t+aoEWwg9Q0Xlrv3rhOjSLFryb70ZJStUsJKZW4CnVh6o58AvBT6wQ8p8EuNPp9a6Kub6a9WZ3psCq9SAhDIvhjXSptZutRqYHI/tU/th1dieWDQ8BSv+2QMhArt6Z8fxB6wXOTEs31HfhAH8dT2X/8JWY5LFPCjGzObsPrJkJLfv25XyKqa2Qu3kOhJzf/6b9u0R7Dh9RfQclRst6Ahey+mPzwvg3si56htmPL86V7HjyVnvxDyHYs9Ta+aQc6YQ3x+8gJwzhU5/I+b6278LVo3tie1T+3lEIKs1c9hx6gqmfXtE9DvNlyDzpRcjx8na2vyRc6YQpjLhl6vWMoW7m4HJ8ZiU1laUv0rJdWEtC4f6GkkhXJlPM3MLMGfdUeZ+ObuPYiPnxJh2hdoq5Uhccr0Goz4QH2wht16oI/z9eQHc4/eWmVuA51cfkaXRdITU+1d0jW2DIvY8Ukx5Yq6/FkzhrLkA/eHZIqHMi1Eqg7SWMlFrFet6j66ICAlAeVWty3ZRYUEovV4j2dFWapJOofso5v5a51uSk31bbOUJV4gNklBzTvvz8wKo+7KXm6CWBbH3LyYiWPZ57J3bi8urMGHlAUnBQGpefyWd8KXcy415JhLKCO2hVAZprWSi1jJixs4ikAHA6F4tsHDjKdGOtrVmDjt/LRRU87tCCa2Vdd/EZvS3hiW3kxisTRwsQRJqzml/fl7URA2TsyPE3j+jIUzSeWIj6oJgHGmL9DrH7pierMCipBO+1Hu5+uAFPD/YN7P6k0+ZF6NUBmktZKLWOmId/p3BX8+Mfm1E+95k5hag97xsjPpgl1MTqbPzutJaORujXgcsGdlFUZ8Tiw9MpHJCDGuQBF8DUEnoeVEXNUzO1ki9f/zzI+WEQs7tzlynPFGBRWknfKn3sqjcNyqLOIKEMi9GqUgnluMMv60Z1h2+6NLJ1FeR6vBvj6MIKFZHW6EFUcp5HeFsHvA8dnsLREeEWOYAa0Z/V3NmYHI8dkzrh8lpbZy2E4OcPGNS0UI6AW9CjAM7337H6T9U64+c+8c/P2Lv+uWrVbI0f+4yk0sNLHOGnL77qnsAmS+9HKUyfAsdJzw4ANDBxsndX/LF2CPF4d8eR/eFxfdDrsmGdT4IzQO9rm7XvnzHWSy3yrPkKuu9GMfcAL0OE9Paop2xoegC4I5gcbzenV/EHHTBglYy63sDYs1gYgvDN4oIxqOpLUQ9r3Lvn9Dz44yia1Wy5rq7zORKVe+wRk7ffdU9gIQyH0CpSCf+OIuzT2P5jnyUXK9BeXV9/ygpGed9gVozh5paNn8xR2Tc2QqTJSbNlKLmj4kIwox7boExUtx8sJ5PG/NM+HDHWcEizqN7tWA6pphdLX/+BVknsXjzaebf2cPieK3UbvuJXi1UzdTua4itZiHWGTwmIgg50+9CgF6Hj3b8itLrN1z+5qk7WuHpAfKT2g5Mjsff2jbGK+uP4ut9F1B9w+ywHR8EIzVAQK2s+0JO/GoEhEkJ9NFCtQE1IaHMR1Aq0iYrz4SFG086fUA86WTqKcTu0h3Rq/VNkq+VGFMcf4bX7u8oWWgO0OvQPTEGU7466PB7fg58tfd3puOJ3dUG6HXo1TpWllDG4nit1G77x1wTXvBRx2OlcWUGs19bpGiJzRyQffwSAOAGozntiz2/odPNBtkbzbkb8urVvnQEHwRjCJMmlAHqJOIV0l6qERDmrMqMI/zBPYB8yggLYhY/TziZego5vlw8el1dtnCpiMmBpFSSThZzxbUq1xqI6PAgSbtaOcEVUYznVCqAw1+eBSUQYwZjae+I0ooaPPn5fjz5+X7maGg52fx55m7Iw3tbXQtk1kidg+P6Klvg3JUTf3F5lSoBYc6qzNgH4Xi62oA7IE0ZAaBOIPt4R/0yNq7wVWdLHqXC780cMGHlfryrl7agsJo4hnaOx/y/p9Sruyglr5BS91bqtRO7i7aG9QUn5xz2yC1m7S+INYNJmYdS7qNcC0D1DTOWbWOr1AG7c0mZg2sPFeC5gR1ka4xqzRx2nhFOscNflznrj2HG4CRMWKl8vUxHLjjdmkdjT34Rcn69AqDOEtSzZSOf1ZDxkKaMsKRZmLP+mOjf+qqzJY+YXbohLBA6F+uF1BIhrDmQ1hwqwN/e2Gyz2+fv74hlOzHxi4MYsWwnes/LdqkRUCzbfoX08HWhXbQrikWcU+gcjUT6+sxZf8zn6/IpgVgzmDvXGDkWgM9y6vtesp5LyjxXQjtrSbHzofMUO3xfoyOCVSuhxbvgDElpgtLr1fjbG5sx6sNdWLz5DBZvPo1nvj6ErDyT5ON7C6Qp83OkZsb2dWdLHtZdesadrZHaqhFGfbBLsI1QdBKLFos3cTAV77Zylgbg8P4WlFbiyc/3Y3JaG2T0a+Nw96lktn05Wjf7XfSpS1exePMZl7/bcfoKs2bQ+hym0usoKq9GdHgwXtmQx1zGhzd/KWFeUTJjutZgmctR4UEwmznUmjnFqz6wIGW+niuqkHUu6zn4Y24BPs05x/xbKUhZ+y9frcSQlCaqltASGwTia3iVpuz111+HTqfDpEmTLJ9VVlZiwoQJaNSoERo0aIAHH3wQly5dsvnd+fPnMXjwYISHh6Nx48Z49tlnceOGrS/Mzz//jK5duyIkJAStW7fGxx9/XO/877zzDlq0aIHQ0FD06NEDu3fvVmOYbkOuac6XnS15WHfpvVrH4gqj35f1QsqqxRKTA8k6Z9DstUed3t8FG0+h1+uOtWb8OZV4EcrVdljvonu1vonpN4s3nxalGQzQ61B6vRrzfzqBOeuPYcrXh0TVVVSqYLJUzaa3EKDX4b7Ozl+qJRU1GPXhLvSel42sPJPL/HlKI2W+No9hL8UmdC5+nt/NKHRIfa6krv2xDeqqD1g/j6mtlDMpqpELzdvwGqFsz549eO+999CpUyebzydPnowffvgBX3/9NbZs2YKLFy/igQcesHxfW1uLwYMHo7q6Gr/88gs++eQTfPzxx5g5c6alTX5+PgYPHow777wTBw8exKRJk/DPf/4TP/30k6XNl19+iSlTpmDWrFnYv38/OnfujPT0dFy+fFn9wauE1GzK8X7gbMkjptqBWLOM2OzYvImDJWs4r5VzVmjecr4y4WzcZgUWv0YRwTCVXlcs8bAUx2iWjONKBHTw1/3jHfmSxqp0xnQtkplbgPe3svle8eMG4NBspvSeUE41hkdSW4iak87OJafKCktCXslVEVSWhcQGgfgiXiGUXbt2DaNGjcKyZcsQHR1t+by0tBQffvgh3nrrLfTr1w/dunXD8uXL8csvv2Dnzp0AgP/973/Iy8vD559/jpSUFNx9992YM2cO3nnnHVRX10XDLV26FImJiXjzzTfRoUMHZGRk4KGHHsKCBQss53rrrbcwduxYjB49GklJSVi6dCnCw8Px0UcfufdiKIgY1XejiGA80auF04zzvoiYqgliFlKpO0K+AkDGna2kDcgJ9ufbcLgAGasOyD5uYXk1Jn91SDGND0vlAXtc7bKVrqc4Z/0x0WP1By2B2OtsX8vUvvrF28NSFO+jVAtAgF5Xl2ybAVeO8VKrtbBqWaWaPa+Us0eBS0GNXGjehlcIZRMmTMDgwYORlpZm8/m+fftQU1Nj83n79u3RrFkz5OTkAABycnLQsWNHxMXFWdqkp6ejrKwMR48etbSxP3Z6errlGNXV1di3b59NG71ej7S0NEsbR1RVVaGsrMzmn5Zg1ezMGNwBu19Iw8x7b1FUVe0tCDnh2ju3illI5ewI63J4sZnwWLE/X2ZuAZ5auV+U4zILSml8pDhGO7umatRTFDtWf9ASSLnO1uO2N5s1UjgIYFJaW8kbzt35RQ6TbTuCxTGedd3hEaNllWr2VDPootbM4cpVNqHPlwPMNO/o/8UXX2D//v3Ys2dPve9MJhOCg4MRFRVl83lcXBxMJpOljbVAxn/Pf+esTVlZGa5fv47i4mLU1tY6bHP8+HHBvs+dOxcvvfQS20A9gCsHWt6Z//FeiaoLYlp3bGatmsBa9krujpD13nEch0tlVcyaictXK13Ws2QhIiTAYX4oJRMPSw0AcHRN1aiTyV/zad8dQcOQIPR0saHxBy2B0rUOlb5vLWKl+YUB7H2ZcEcrTGGsHMC67ohNyCs2eELtwC7W5Nz+EGCmaaHst99+w8SJE5GVlYXQUO+TjKdPn44pU6ZY/i4rK0PTpk092CNbnOVocmfmZLE18DwFa9UEloVUbnZs1nsHAE/+6ZPDej4ltEbOEnZKqZEnhPU9yTlTyCSUObqmYpLzioV3Wnc1p9XImK41lK51qPR9k9M/1r7ERASLWlNZ1h2xdSn59YN1beCrD6jxLmCNAvWHbP6Axs2X+/btw+XLl9G1a1cEBgYiMDAQW7Zswdtvv43AwEDExcWhuroaJSUlNr+7dOkSjEYjAMBoNNaLxuT/dtUmMjISYWFhiI2NRUBAgMM2/DEcERISgsjISJt/WkOsilxpvMmxmcWBlodfSO/plAAAWHf4os1vWP3PujWPFjyn0veOz4IvR5uh+/M4LEg5j7N70K15tEvHb72urp09UusPisHVnJbj3O0tSAnScDZuJe+bMTJE1rVl7UthebXifoFStKwDk+PxBGPt2id6tVDlXSDGxzAuMgST0tqg6oZZsaAhLaJpTdldd92FI0eO2Hw2evRotG/fHlOnTkXTpk0RFBSETZs24cEHHwQAnDhxAufPn0dqaioAIDU1Fa+++iouX76Mxo0bAwCysrIQGRmJpKQkS5sNGzbYnCcrK8tyjODgYHTr1g2bNm3C0KFDAQBmsxmbNm1CRkaGauN3F0oVNBeLWJW7J5GizXP1G1earvs6x+Nvb2x2ek5n906sGZI/b0y49ALJADD69kQs2HjSZXuxWglX13PfuWKXPnBmDth3rrie5oE1Oa8cXM1prWiu1UTpWodK3rcR3ZvJurasfVny8xmsPnBBUUuAVC1r/yQjPtpx1uXv+icJKx/kwKqVf6hrE2w/XYgFG09ZPtOiNUUJNK0pa9iwIZKTk23+RUREoFGjRkhOTobBYMCYMWMwZcoUbN68Gfv27cPo0aORmpqKnj17AgAGDBiApKQkPPLIIzh06BB++uknvPjii5gwYQJCQupyrjz55JP49ddf8dxzz+H48eNYsmQJvvrqK0yePNnSlylTpmDZsmX45JNPcOzYMYwfPx7l5eUYPXq0R66N0qiVd8YZ3uLYLEWbx/IbZ5qucX0T8f7W+mWvHJ1T6N6JNUPyWfCPm64y/8a+3+8+3BUZ/VorrvERup4FVtdDjk8Wr8FRG1dz2tOaa3cgNMZoCbUOlbxvLWIjZP2+e2IMIhijL5W2BEjVsnpaO8v6zH6z/0I9nz0tWlOUQNOaMhYWLFgAvV6PBx98EFVVVUhPT8eSJUss3wcEBGDdunUYP348UlNTERERgcceewwvv/yypU1iYiLWr1+PyZMnY9GiRbj55pvxwQcfID093dJm2LBh+OOPPzBz5kyYTCakpKQgMzOznvO/N+LKyV4JJ3xHx/AGx2Yp2jwxvxGq+fa3NzbL1iBKuW6mskrsOVvI3D4mIggz7rkFxkjbeeFMG8IBGH4bu2+lKxMHB2Dat0cw4c7WTMdzpFWw1uC4wyji7N54SnOtBKxrhf0YYyNCAB1w+WoViq5VISYiuO4+6YAr16qQc6bQ4bHE+kY5IzYiRNbva80cKhijL5W2BEjVsnpaOyvnmmvNmqIUOo7jfNMwq0HKyspgMBhQWlqqGf8yVyYhR98bI0MxonsztIgNZ3phCJ1j+G1NbdTRQqwa21O2Q7hUcs4UYsSynS7bWfdRym/kntMRrMexJiYiSFQWe2f9cBVRxWp+EDMOvQ5OTZjxhlBsn9pPcL466jOvvSmpEHddnOHJOa0WUgN2WK+5s2Mt2niSaS1xxop/9kCv1rGSf//htl8l1Q9Wci7M3ZCHZdvybZ4BvQ4Y2ycR0wclCf5OiWArKZv3HaevOC1Nx8qqsT3RPTFG1Y2Mu97fXq8pI6TjqsYYb0Kr931ZpY3PkLOH19k5Fmw8hajwIJRW1DhN6+BJx2Yp2jzW3/z4p9rdfvFQSoPYPTEGxsgQpqz+PGIFMgC4WFKXrd9+MeS1IYuzTzv0MWOtZWcqvc7cF1c+Zfd1jmeugWk9HgCWuphXrlXjrayTuF7DphWxRgtzWg2k1isU+p0jAdjZsZpJLHFkDWuZNCHk1r6UC18pwf5amjng/a356NIs2uE9qDVzMIQF47n0digqr0ZMg5B6mm+Wc0sR6uRec56NeSZM+eqg5iP4WSChzE9hyR6+bFv9B9wRQoslixmPR6uOzVIcaFkd5T/NOYdPc87VWzxYf++qXVaeCZU3zC6Pw+Jw7YxnvjkEa327/Xi+2HPe4e9YzQ9F5dUyemfL2kMFeG5gB6dzSigFgfVnFdU3JGlm1Ewt4CmkBuxIye7v6FiZuQWSNFT2yDVfKlH7Uios19LRPXAmTIkRyKQWEFcqvcuHDoIVvLWAuaYd/Qn1YHECZ404FioBw+LIX1JRg0lpbTXr2CzFEfa4SVzlhoLSSjz5+X5sOHxR1O+dteMXShaTm9y0AvYOEH+Np0CRYI6YBvJelvZ9UyJwpE3jhgr0xjeQeo/lZvcH/prnigjuMuXkR1JbQCfiGEo60Uu5B0qkI5JbGoxlfXUmGzr73ltLk5FQ5qco7Tzv6KFnPUeL2PB6Ne20Ul9TSg2634rZzW3WZKw6gA2HC5h/L9SOZdccExGMBf/ojFVje+LFwR0k9NY1Gav2439H2SKj+EoCjnKQGSOVjYqUO/drzRzmrJdW8YDX9HjTS0II/n79yBj9Zn3da80cdpy+Ivnc1pUnlLqSck1pAXodwoKUqX0pFrEuD0rVWZW76WKpYzu2TyJ0Dr7nNfzOuqiVCH4xkPnST1ErK7j14iDG9MeaLd8TsJZO4pFqxjBzwFMr9+Ohrk2Y2gudh0UDUVReDaMhDKmtGiHnDHu0pRjMHLD8l3NMbc9eqUDvedkOzSj9k4yIN4QqVptS7tyXU/FAyWoGnoS1LI41/HWX8ltHx1K6XqkS84I1+lJo7ZCKWDcLsRUAhFDC/3Vgcnyd//K2/Hp+FGHBAejSLBrvPhztcP29O5ktz5o3lSYjoczLkZquQmztM1asFwfW+oze4PQsJk3BI6kt8OqGY5KLeW88ftmln5deV3ceR4hdKNWaCzx6XZ2JU2gOGMKDsHDjSac+KXzYPiDd/02J+VZ9w4yvBHzkxOBNLwl7WMvi8Fhfd7G/dXasdX+a++XCH5OvniE1eo/1nmbc2RqT+7dV1K9Q7FqrVDCREqXBMnML8N7WfIffVVTX4snP9+NvbWPxRK9EtDc2RFFFteX+7M4vYhLKvKk0GQllXoycMGaxmbVZsPeP8HQOHKVh1eYFB+oxtk+i4ELjipKKGgzuaMT6IybBNmP7JCI40LH3gdiFUo25YA0vnDqaA/zfrpzEt0/t51BbyYoS881RugGpeNNLwhqxJkPr6w5AlrnR/h7Kdczn4QAkN4l0WT3DFWevsEVf9modq/ia5ypfm32AiVJ1VrsnxiAqPEjQd9XVRqjWzGH22qMu+7Hl5BVsOXnFkt5jyKAmTOcH/iof5y2QT5mXooSTJm+Wi1aoflxyk8h6i40/ZCh3xPRBSfhX30SXtRiFMHNw+Hu9ru5zZzmHpAQnCN0npd4dT/Rq4XAOTE5r43RBtTajDEyOt/E9nJzWBsZI2xdzvCEU/+qbWC/Lu9z5NndDHt7bKl8gUztDutqINRlaX3cxv2XK7q+gXJOVd1nWWpqZW4CFLkqLaeneK5XJPyvP5PL5dbYR2p1fJCplj5kD3tuaj7kbxJeP8xZIU+aFsDhpTvv2CBqGBqFnS+clkwYmx+Nq5Q08+81h2f3adOwyqm+Y62lwvDlDuVRqzRzuaBeHdnGROPBbMX794xp2nGF3Nv0x14SlD3fF8Tnt8VnOWZwrqkDzmHA8ktpCUEPGI1VDKVRdYN+5YpjKKrHj1B/4Zv8F5jFY0z/JiBcGJ/25CFdasraf+aOc6fe8GcVeW5nRr029rPBXrlWhd+ubcNxUht+Kr6N5TDhG9miOg7+V4PuDF0TPv+obZizbJk3raY03aoftYTV7PZraHHcnx9tcZ3bzXitM7t8OAJyuGUrluBKCNWULq/ZQzXQorurc2o9DCSsGS23d6PAgp3UzpZrxl23Lx9MD2mPfuWKXEeZ8+Thv8eEkocwLYdlxllyvwagPdsEYGYrZ9zkvmv3yOmmRZPaYOeCznLMY06dlve+07MivNI7MyqFB4pXS/CLq6Hq6gneeXbYt3yZlhe5P9b/QfODvE++r+GNuAc5eqcCq3efr1Z5jwdp8EaDXofR6NeZnHhdtghQyo/D9zcwtwDPfHBI8rr2PnxjT1Gc5ZxUxWXIA+ic19mrtMKvZ6+7k+HrPO+tve7W+ySIMOFsz3GECZnF4Z9UATk5ro9q9l+K4LzaASew5AdcCkdR7yL9rYhuymbC9yYeThDIvRMwEM5XV5Yxa6sB0I9fp1hFSs1r7CkLXtLLGdRJXe+RE6UnN7s3/ltV3a3JaW7SKDUfGFwcdfm9d51LqfIuJCEK35tGC37Mc116osg4icKXFVXJOZ+VdxtwNeU7Nz1pGTvCO0oE/ageoWONszWVP/SOv4LkzpDruD0yOR7/2cTbaeFatclaesM8ra9+kVB3hOVdUgaQEA1Nbb/LhJKHMC5EywaZ/dwT92sdh37lii6ln9lrlcvzwSE0H4Y1U3zDXW8yUzJsE1AnVYiPCpGb3BsQJTjrUZevfPrUflgbqBQW5BRtPYdXu31B5o1bStSkqr8Hf3tjscPcuNVcVb5qa/t0RzF571OalYK9Fq6i6IaHXwvCmF1dmaDWRGrUtx+yldOCP2gEq1jhbc5VympeD1D442oCxaJUzcwuYoh5d9S1Ar8Ps+26RVFC+aXS4T0X481BBcjeiVEHTWjOH3vOyRe8Qo8ODUKxgUWVHHHt5IMKC2RIoejOOIvHUeDHY3zMWs5vUgub8vBJrWuSPU2vmBOtcyoV/Tds760spui7mXAAU1yYDwIzBHSSZpZVAieLTco6hxPldHU8p+Je6syL2LOuxXgcsHtEFgzolKN5Hlj44GgfrBsz+2ROzTsS7uHY8mbkFmPbdEaYKJDyfje6OPu1usowDcCzoKxVQRgXJCUGk7hDVFsgAYP+5YvRqE6v6eTwJH4lnjxq7G/t7xlLPTaopQ2oyTuvjCNW5ZKVBSACuVdVPwCnkdC2mWDkr/LnqQvV1qtxXT5n55dQpBP7SsFXdMOP/HupsCawQo21zFlAiJRDD+nhZeSZ8tOOsIhskVu2d9XoshJkDJqw8gHf1OlX8ysRqIcVomO2fPTHrBKvmk7+HO88U4pOcs/hf3iWXvym6Xm35rRzfOK1BQpmXIjQRPU3Or1d8WihTKhJPKvwi+vzqI7heY4Yxsv4LTKopQ6ozLGuWcBYcCWQ89s7KShWiFjqXFD8XVjxh5pdaOJzHmYZLrN+jdeBPZm6B7Bxh/PFSWzVCUICuXoCLNXodW13fuMgQzL7vFqY+DEyOxzsjuyJj1X6nx3YVySkHMcKJ2GfV+tljXSfG9GohSiAK0OvQq00s9Hodk1BmvX75UoQ/CWVeDD8RP9r+K17dcNzT3fkT73sInGHve5N7oVSRSDy5FJXXYPKXBwHUf4FJ9bMQ6/MiNUu40LEMYUEoue5am3v5aqUqQSpSEauV0TmpxqAmckrryNWwCbHh8EU8tfJAvc+lHJc3n7tK2sz6/L75jxT0as2+wYyOCGauwygkxEr19eNhFU6kPqs/5hYwbyjSnKTCcEYxQ3F5RznUfCXCn4QyLydAr8MTvVvinZ/PiLLHq4UvPBQ8jjQD4YwFh92J/QtMqkO1mGg2R8eR6sjMH2t0rxZYsPGUy/axESF45ptDmhDIAPFmsrCgAI/s4KWateVq2ITYcLgAGavqC2RSjpuZW1AvYEMuYvOgyS1dpJSvHYtwIvVZ/TTnHADn2kY5zvW1Zg5z1rtO0TRjcAev1IKxQBn9fYAAvQ6vP9DR091AdHhdslpfQKhiQkUNW8Fhd8KvjS/9kIfaP1dKKZUUeGEOcK3v1OmAcX1t8525yhIOABHBAfWy8PN9yujXhinLOHSQbSaNN4QiKjzI6bmMkSEwRjofjxQqqmuxO589kbBSSDVri9GwsZKZW4CnVjo39bEel39WlTY3ixVc5ERhKlGhRQz8syoVZwIZID1JLqtZNVqhEltahDRlPsLA5Hgsfbgrpn57GKXXlQ3hZ2XuAx19YvciNc2CJxFKDinWz4LVV9FRvjNX9fcAoLy6Fm881BnREcEO+8Si4cs+5trfxBlDUxLw5j9SkJVncnqu2ffdAgCqpFzwRDJLFk2oI7OQUsWreVgywbMeV41nVaqmR6rbAEuFFqV90QL0OtzXOV5yfV4ee42ZXOd6peeaN0KaMh+if5IRYUGekbMnp7X1uigXIZRwWPcU9osVb8oYktIEqa2cl9zi4WtMrhjTA1FhQU7bWmvnas0cGoYEIcJJShQdgDnr89A9McZhn1xp+PonGbH6oLRSTzw3R4ch4M8oOEfnMoQHYVJaW/RPMgq2kYsnkllaa0KFuK9zfL05onQeLrHPl7PjKv2sytH0ONM0OzsuyxgKSiuxONu1aZ+VWjOHtYfka9/MXJ0pcdHwFKwa2xPbp/aT9B6oNXPIOVOIU5euMrX3pmSwYiFNmY9Qa+bw8Y58SaVw5GKMDEFGv9ZuP69aePMujGWx4p2JTaXXUVRejZgGIfWiOAP0Ouj1OqeO99baudLr1UyRwCzOzs5SJizIOoGicnm+k6kt/3Le5jOaP//dYaw/YsL1mlqUVNRgwcaTWLX7vKVEGd+fH3MLLH41UokOD/JYMku+/JaQhuS9rfnomBCFRg1DbK69kgk6xTxfep1zx285z2p0eBBCAvU2Zk+5mh4p6RlYx7Bg4ym0MzZUZPOrpDAb2zAEQ1KaSP692DxzWinqrhYklPkAaiZPdIUOdWYeXzBb8rDuwsKC9LguoXySGrC+GJ3NFXunYtaXRVaeCct3nBVlQnJ1bFcpE6QSFR6EnlbCoLOklfYlyvj+yBXKPGkWZ9GQZHxh63wfbwjFfZ3j8f7WfJeBIyzRg2K0HGYOeGrlfizVO/aDlKMx+cetN+O5gR2w89dC5JwpBMAhtWWszfyQgli3ATFjUMqMqeTGU849kBJF7Uib60uQUObleDI1gJws3FqGxTckLjIElTVmtwpl93SKx/rDdS9UKSVqXM2VArsoTtbFds3Bi6LnH+uxlZ7fr1v5PbIee/p3RywvQn5uyBEQS1wUaVYTKRoSU2kl3t+aj3F9E7H2UIGgBog1erC4vIo5VxiP9T2whiV9ghBrDxWg881RmLP+mKXPizefUWRdE5OeQcycklMP1xolzH9ySxhJ9Qdce6gAzw2k6EtCg7jbId0QFogXBnXAgmHy/Ae0DotvyIjuzZhyailJ/6Q40RGVPGLmCu8n1j0xBlHhzn3KIkICUCTixchHULIs5FLnd3R4EOIaBtt8ZowMsWi8xB67uKIGv5y+gpwzhVh3+CKG3dpUZI/q4ykTuZTz8tdo7aECbHn2Tqz4Zw9k3NkaGXe2wv891Bn9k4zM0YOZuQWYsPKA6Fx/xRU12PlrocX36PuDF7Dj1BW8vI49YMCegtJKPLXygNsiHoVg8fWzRom5Izf6Um6UJSDdhCo20tfbIE2ZF+Muh3T+kZv3YCefFMIc4co3pOqG+82WjRuGIrVVI0mZq1nnirXPF9MOWMTLVexCLnZ+80ec+0BHl9dI7LHHr9jntNqAWKRqKuQmF5V6Xn5evPvzaXyx5zcbzZIxMlSw2Lx1vrF+7eNkbSI/33kOz3x9SPU1z1WONLn3wBEDk+NxT6d4rDvsWhBUQsvFEintDEc+cmKvixzh0pv9fl1BQpkXI2ZihgbqYeY4VNe6XhJDg/SotDLLeWsNMbnY+4bERoRYav1duapeCR5HRIUFwcxxqDVzkjJXi13ELl+tE8xcJSQur2YXVMTMo1ozhx2nrzAf29HxnV0jsddDKYFMjslHieSics2vjpL7ugou4gW6z3LOyhKofsw1Sf6tWIQCUpQupm593PUuBDK55kJ7BibHY8nILshYJU5z+XCPZph57y0IDvzL0CblusgRLin6ktAkYiZmpQjNTmWNGQ1CAvGPW29G/yRjvR2PGjtFrcILQJm5BXj660MeiW4FgJLrNRj1wS7JLwApiTBZBZeosCCUXq8R1IBEhQXhnVFd0bMlW0oOsYEr96ckoEl0mCgnbU8s6nJMPkqVOZKrIZGDp4qwy8H6GRBzD8SskaymdA7yzIWOGNQpAYuhw1Mr2efD57vOY9Pxyza+hFLmZvfEGMREBImOpm4UEUzRl4Q2UcLpWIhrVTfw0Y6zMITZhu+rtVPUMpm5BbJfYkolH5Vaa1DMXDFGhqB7Ygyz38boXolYuPGkYGTe6w92ZK4hKNaxX68DVh+8CKDOlMZXlWh1U4RFSHP0EuP95dxZmkyqxlnpMkd1GhLXxbOVxhNF2OXCC+9i7kFWnqneGhkVFoTRvVogo18bSXnKAGByWhtV1thBneIx+XJbLNh4kvk3/Dr0zsiumLNe2twM0OvwypBkh7VPnTFnSLLPKgEAcvT3asQ6iEphwcZT6PV6NjJzC9xeCkQL1Jo5TPvuiOzj/Hd4itOkqqw4KqnEAj9XWJayyhtmZOWZXJZN4p32M/q1lhyAYI0Ux377S1BcUYMfc01YvPkMRn24C91eyfLYvPxbm1isGttTdmJNNcocDeoUj8UjujC1VeL1FxUehEdSW8hyLnc31gEprPdgcfYph2tkyfUaLNh4yuF8ZNVIt4iNEDcAEbSIFScw84/djO9zZc3NQZ0SMLZPIvN5/9U3EYM6+ebmn4eEMi9nYHI87k6OU/UcfL6mad8dcVkKRIyg4A3s/LVQljYl3hCKpQ93xT0pTfDmPzor0icpL2Hgr+AFVy/G0ooajP98P7LyTE6jUDkAw29rinWHL8IQFowtz94pSwgR43zPulEuqajBkw42DCz+cnLp2/Ym0dUUHKFW6Rk9Y3+MhlBMTmsr6tj14Di3bCKVxDofFuu1dZWvr+TPZ8t6PipdMUEKUo7NAShkjLx2dv1eGJzkUjBrEBKAJSO7YPog75k/UiGhzAd4uEcLt5zH2UtMqqCgdeqSSorn0dTm9QQTvj6pMVKZxVVKBJJ1CSWDQAklayG7X/s4TEprU6+tITwIUeFBWLDxFCZ+cRAjlu3E397YjNLr1RiS0sRi/vz+4AXknClkEtZZxzM0JUG02W322qM2fXBH9NbIHs0VOY4aL22W+pNRYUFY8c8e2D61HzL6tXZZbN4ZJddvYHH26T83kUaJR3Evaw8VWOYM67VlSZPDwXYDy6qRZk0jw6cMYX3uas0czGbOZUk1Obi6fi8MTsKSkV0QE2Hbh6jwIExOa4tDs9IxqFOCav3TEuRT5gOUXq9RvGCyVHwvVFnaVb3bKgO8NXxE5/aTf+Cxj/fI6pmjhY7FwZgvoVTKUEKp59xNNnnIwoMD0LmJATvziwQdex0lGWXxO2R98d0cLd43yVRWhcXZpzDxT42POxz9D/5WokiCWKmFrp3BopUsuV4DvU7nslg8Kws2nkQ7YwM83LO5WyMppWIdfclyDwwifBStj81rEB1dWzHBIVL8fZWoBhMTEYTicseBPmLm5qBOCUhPjvebIDIhSFPm5dQlY2RzjHbH1I5tECJ6p6ZlerQQ91Jl2dUG6HX4W/vG6HRzpKQ+CZ0jM7cAvedlY8SynRbtVe952Q59qliFZ/vEsBXVtchxIJABdS8TDnX1E6X4HbJqDKQKOgs2nrKcX27yTBaU2qBILXQNCGtOpJhEhYqz22s3nPHSD3m4rYXrpMRCiDmXEvDjZ7kHo29n940CbFOJCF1bVr9MKf6+Qr9hhX8eXxmSbPnb/ntAXMQoH+0u1uQvRUOoVUhT5sWwOEbrALwzsgv0el39iCAFo890fx7v6a8O2hT49fbITH2AOFGWNWy91szhj6viS8QILXRiw9LdnRKCJUqQVWNwW4sY0WV6eKzPLzY1RKOIYAy7rSmW/HyGqX1sgxDxHRRASqFrZ5oTqSZRoWLxrLVJC0orse9cMYbderNgUXQhYiKCsHN6GvadK67LG9ggBFO+PIBLEp4jVqzH7+oe9E8y4os955mFnDnrjiIsSG/j3iAlMbSU6Fy51WCsn8eByfF418H7xV35LX0tIwAJZV4MiwmCAxAdESKYCT4rzyRYkNkaXj1f+mc7+4eZQ13kmz1SUzhohSvXxCWJZQ1bZ3Vqt8/jI5RJ29miDADTvjuChiFBlhQRrswxaiCUkNMaFuEj50yh5FQO1ucfmByP/47ogn+vYgvJLyyvZg4wAKC4P4GYl7YrIf2dkV0lm0QdJS8WI+CayipdFkV3xCt2qRD0Oh2Gd2+GRZtOiz6WK4TG7+oeiLkOReU19dZGKYmhxUTn8seWWw3Gfh2SKlDKRan8fVqChDIvRqwJwtEDzz9MO88U4vNdZ/Fj7qV6v7fkm3qgIwCI8kGQkkdJS4jVKLGGrbPeuxn33AJjZKjNQgfUBSDwn5nNnGv/oIoajPpwF4yRIZh93y0YmBwv20dIKq7G7mqBl2sWtP69eG0W+/y9Uq581QeWlzaL5mTO+jzMGNwBE1YecKqVBGznmtCLdmByPCantXGY8d+eomtVogWCf/VNhF6vQ+952fXyfymNK7Obs3vAB/OwbHR55K6NUkzRUp+hjk0i8fygJEFfVSV8KFlROn+fViChzItRKiorQK9Drzax6NUmFj8cuojnVx/B1coblu/td0VmMycq4R+LhkSriE3Q6+xaWzvhs5ZpMkaGuizzIubFZCqrwpOf77cU53aklZKSZVsMLPPW2QIv1yxofX6xL6fUVo2wcvc5pusTG6Gc+VIMrJqT6IgQp1pJAPWEIGdmoYx+bbBq92+CVS947VNMRLDD7x3RKCIYc4YkQ6+HQ40IS7SjWOSa3fhNxeLs01i27YzTEl1KrI1S3gNS3Rda39RAtTXcPkipW/Noi6laSv1ab33vkFDmxSgdlTV3Qx6Wbcu3MQ3pANzTyWhTPmTO+mOS+uuNkZmsvkeurrUjYcqZX5Sj4wmp6qW8mKZ9dwT9k4yyfYTEotcBxTI0SJm5BZi99qis83drHm35W8zLKd4Qip4tGzFnIX/660OYfZ/7/VrEaE6GpDQRdGsQaxYK0Osw+7467SsgrH0zhLEJZTMGd8Djveqc53vPy5aszQ0L0uN6DXuZOY5jSyPhzFQXoNdhYlobNIsJw+SvDrk8npy1Ucp7QKr7wgNdb5bcT2ewrI/2GwK18vd5Goq+9GLkRGXZM3dDHt7bml9PSOAALNt2FnM31OU0kuOLoKTjsxYRutZCUU7OBDL748l1zLWnpKIGO//MwWYf8RQcqGfO/m9NvCEU/+qb6PR3Zg6YsPKApCz7/HW0DiQRi5kD9p0rtvztKuKTR4e/7segTgn4V1/XkXaXyjxT6UKs5sT+/gNw6aMolCiaJYqQNcr28V6JCNDrZPs/De4oLr/VpbIqp/dNTJSz0RDGdE45gTdS3gPOfiNEREgAbmcslyYG1vXRPpJUC0l31UDTQtncuXNx2223oWHDhmjcuDGGDh2KEydO2LSprKzEhAkT0KhRIzRo0AAPPvggLl2y9Ys6f/48Bg8ejPDwcDRu3BjPPvssbty4YdPm559/RteuXRESEoLWrVvj448/rtefd955By1atEBoaCh69OiB3bt3Kz5msQgtgobwIExKa4v+Sa4TNVbfMGPZNueRUMu25aP6hlnersOLopT5EOvV+3/H86tznbbV6+oiXB1pRFiEKXs5zmgIxTsju8AQFmwJ8d55plBxzdUnOWcFw8f5eeWqNNQ9neKxYFgKZgzugOfS2+GOdnF4e0QXlw7xYqs/KCmUWs9hlpdTdHhQPc3Q9EFJWDKyK6KdpHbwVKULuclIWc1CH+/Id5iCgE9QLFTdQawQIVfT0bNlDEIC2bcYzu6b2NQTSiaGdYaUlBpCvxHizb93luSb5SxdhZjnmk+5w98Xd11bd6Np8+WWLVswYcIE3Hbbbbhx4waef/55DBgwAHl5eYiIqHOonjx5MtavX4+vv/4aBoMBGRkZeOCBB7Bjxw4AQG1tLQYPHgyj0YhffvkFBQUFePTRRxEUFITXXnsNAJCfn4/BgwfjySefxIoVK7Bp0yb885//RHx8PNLT0wEAX375JaZMmYKlS5eiR48eWLhwIdLT03HixAk0btzYMxfoTwYmx6Nf+zg8/91hbDhiQkVNLUoqarBg40ks35GPx29vjttaNMLlq5UoKq9GTIMQGCP/Url/lnPWZTSbmQM+yzmLpASD5H6q4fisBmITKpq5ughXR7Ds8s0c8EjPZtDpdGgeE47GkaGYs/6YJL+x8OAAVFQL+7BY87+8S/hf3iXBYsn9k4xoGHoU5U6Ot+P0Few9W2SjvYqJCHY6n6T4esjVlljjKM2DI98qZ0Wkgbr6kYbwIIz6YJfguTzh1yI3GSmrEGTtxmBvWnLl9C0mxYdcTceL3+ei6oY4odjRfZPiWK5UYlgWpERA2v/m7JUKLN+RX88lQmpeOVfpKqQ819b3xV3X1p3oOBYDukb4448/0LhxY2zZsgV9+/ZFaWkpbrrpJqxcuRIPPfQQAOD48ePo0KEDcnJy0LNnT/z444+45557cPHiRcTF1dWIXLp0KaZOnYo//vgDwcHBmDp1KtavX4/c3L80IsOHD0dJSQkyMzMBAD169MBtt92GxYsXAwDMZjOaNm2Kf//735g2bZrD/lZVVaGq6q+XVVlZGZo2bYrS0lJERkpLHOqIzNwCUdE+PPzD8cuZQnyac85l+0dTm2PWvbeg97xsSakUVo3tqSmHS0d+IUK+NK5YNDwFQ1Ka1Pv8+4MXMPGLg4r0l5W+bWKx9dQV0b+LCg/C6w90tLwQc84UYsSynUp3z4LQNXOEEteR963ZPrWfYKJVsSH9qw9cwOQvXfdraEoC5j/UGcGB7jNOSM3fJOW+81dJbAoClmtea+bqBRy4C+s5ynpdHK1zGw5fxIvf59oEiGgxl1ZmboFD/1kp91fIB9b6WHvyi/DhjrOi+3l3chwWj+yGAL3ObXnKysrKYDAYFH9/26NpTZk9paWlAICYmDp15L59+1BTU4O0tDRLm/bt26NZs2YWoSwnJwcdO3a0CGQAkJ6ejvHjx+Po0aPo0qULcnJybI7Bt5k0aRIAoLq6Gvv27cP06dMt3+v1eqSlpSEnJ0ewv3PnzsVLL70ke9zOEJr4LBT8qXJ/oCvbi7F5TLikpJtiAw7cgaMHOTo8CFU3zJKu5ZWrVag1c/VeKJ7wZzj8e6mk3/HFu5eM7IpBneKxMU/dUjhiro0S19FVYl+xIf2ZuQV4cc0RprZrDl7E2kMXMbZPotuKKkvNHSXFCVxqCgKWax6g1+G+zvGik80qgZRIXft2mbkFeHldno1AFh0ehBmD6xLOsqQcURNeMDaVXhcM4hJ7f1m0itO/O+IwtyULP+ZeQq/Xsy2BNJ7IkaYWXiOUmc1mTJo0Cb169UJycl1ZB5PJhODgYERFRdm0jYuLg8lksrSxFsj47/nvnLUpKyvD9evXUVxcjNraWodtjh8/Ltjn6dOnY8qUKZa/eU2ZUijhZ8OhbrfvCh2A6PDgusSd7IFMmlQjCwmyUhcIoM6U88H2/Hq7M08kaZWbJiBj1X6M+a2FpB0sK/ZRkK4Qm5rEEUKJfaVoyIQ0Cs4wc7AIFu4SzKTkjnJmcnOGWqbaWjMnKdmsHBxtJKU4lgvNk+KKGjy1cn+9qiru1p6JcdWw9iXkgzCEYPFLlLPeAnVJiK0jgbVkhZGDph39rZkwYQJyc3PxxRdfeLorzISEhCAyMtLmn5Io5WfD4oPMAZjy9SGMWLYTGavYX0YGB07SnkTpKEZrChw4+0qJcvI0Zq4u4lbtc1hHQbqC15bIwVFiXzGRdDy1Zg6z1+ZJ7gcfNKNlxDqBW6N0CgIl/QlZENpIinUsrzVzmPadc02qvcsJS41YpZBa+3LO+mMunxF3paGwdvz3FbxCKMvIyMC6deuwefNm3HzzX3lSjEYjqqurUVJSYtP+0qVLMBqNljb20Zj8367aREZGIiwsDLGxsQgICHDYhj+GJ/BU/hUx8z8sKIApAtRduGOBt18k5LzgfBlT6XXmtpm5BXhfpvnKXtMh9FIqKK3Ek5/vx4bDFx0eZ3d+kWCCVBb4oBmtYx9FOWNwB6bfKW2yd/c6JxSxKDZqdOeZQtF+vu6K2JW7OXW0AbXGnW4bvHbWV9C0UMZxHDIyMrB69WpkZ2cjMdE2N1C3bt0QFBSETZs2WT47ceIEzp8/j9TUVABAamoqjhw5gsuXL1vaZGVlITIyEklJSZY21sfg2/DHCA4ORrdu3WzamM1mbNq0ydLGE3hD3i+tPTBqL/DWJhxrrF9wC/7RGW0bs5VjkkuIyILq7mTO+mNMGgEltJv2ofEsx8xYdQAbDtfvnxJz6FxRhexjuAPrHGaP90r0SAoCd77gh6Yk2KTvsEdM6omcX8UH2wDCa4iSKLE5daalYs3/pxTeliDWGZr2KZswYQJWrlyJ77//Hg0bNrT4gBkMBoSFhcFgMGDMmDGYMmUKYmJiEBkZiX//+99ITU1Fz549AQADBgxAUlISHnnkEcyfPx8mkwkvvvgiJkyYgJCQOqHmySefxOLFi/Hcc8/hiSeeQHZ2Nr766iusX7/e0pcpU6bgsccew6233oru3btj4cKFKC8vx+jRo91/YXi8RGOrpQfGXQu8ozEH6HUovV6N+T+dcJs5pqq2bpK4u74lC8Xl1UxFg5V4gcwYnGTjN2bmXNcLNXPAUyv3Y6m+q6IpGoC6oBlvwx3pHRz59ynhT8jKLQkGrDt80WWdTzbHcnkiiZrrplLHFvIhZJkrhvAglFbUKLIueVuCWGdoWih79913AQB33HGHzefLly/H448/DgBYsGAB9Ho9HnzwQVRVVSE9PR1LliyxtA0ICMC6deswfvx4pKamIiIiAo899hhefvllS5vExESsX78ekydPxqJFi3DzzTfjgw8+sOQoA4Bhw4bhjz/+wMyZM2EymZCSkoLMzMx6zv/uxFvyfmnpgXGX4/3ZK+X1PpMTKSsXrQlkAHtEl9wXyD2d4jFnvfR6ofb9654YA2NkqCwT5sgezSX/1pOIyS8mFmepDfgXvJrzWAfg1Q3CudesYQmgSG3VCIs3n5bcHzXXTSWPLeSG4GquABAdTOKIBiGBmorsl4tX5SnzdpTOc6J2Him5uMoL5SncJRwttdIASc21ZAgLROn1G64baoCI4ACniWad4SyHnRbm+eS0NpiY1tbyt5ToS2u0lrNPLFIiVp3BktMKgKikznKRmnuNp9bModsrWaL9ytyxbkrtmyNmDO6AMX1aOj2X0FyRmmPTmqjwIOx7sb/q7xh35SnTtE8Z4ZxuzaNVt9lLPb4WU2Hw8Du4eBUd73kNEO9vIcUE1+nmSMx7sBN00HbkJt+/eQ90REwEW7Fpe5xpw3jtpidZsPEUFm08abmfA5PjsfThrpIznWvJpC8F+3qZUp/xWjOHHaevYNq3RwRzWnEAnl99BP3ax9n4Zb4wqAMMIjSeYpHrdB+g1+H1BzqK+o3S66ajEke1Zg47zxQqFgEc48K32dlc6Z9kRGig81JuriipqNGU37JcNG2+JJyzJ79IVW1PREgAyqukaT6UMGeoibVfyKc5+fgx95LrH4nAPmeTFFPXH1er0T/J6NAEoCWMhlDc1zker/54HEXl1ZKO4cycEqDXYcbgDnhq5QGpXVSEBRtPYemWXzG4Yzxe+7PyQb/2cfjkl7PYc7YI4cEBSDJG4rVM4dyFPKcuXUXOmUKvTnIpFzE5sorKa9Bz7ia8dn+yTcWJUpk5+VwhN/caL7zPXptnswbE//nMrD1UwGQGlppLr17psD83EUpoyCx9jpS+YZIbyczj7Zsca0go82KkRvewIkUgy7izNXq1jvWKlw2/g7t8tVJxoYyHXyyKron3/ysorcTOM4UwhAXjuYHtUXStCjt/LUTWscuuf+wG+ndojJ4tGyGuYSj+/cUBSRsE1moPQrVFlUCMT8v1mlp8s/93fHvgd6R1aIzcC2U2L70t4ZeZzLiLN5/B4s1nNFlqxx1IcSEosgsMceeLWOy57IWorc/diX3niusJVc8N7GDTrlvzaOw7V4zvD16wKf0mtoyQ0PVVUhjj+yHHn0upe6glv2W5kFDmxWjRGbBNXAOv85Vxh0OtVLPehJX7bTL0NwiRp+pXkqxjl5F17DL0OmlzUYypRs0X8MS72mDhplOifsNxQFZefeG4uEKc/x+fLFRLCZbVRm6KkxdW5+J6da1krawUxKwRmbkFmL32KExlf23EjJEhmH3fLfVqvVoHDGTmFuBvb2yup9lyJEg5mzdqJsjmUcrMKnft1WIJP7mQT5kXExUm7UWvJt64Y1Ejp459ziajIUzScexLJl2TaE5WE6k5LoWSdDpCrTqc8YZQ/PuuNphs5cTvTtyVLFRLyElxwgEoLK/G5K8OCdZpVBKxudf4ABBrgQwATGVVeNJJslWhRMZCmi1n88YdCbLFPLvO6NY8GnINKlr0W5YDacq8mNgG8oQyna5ux68E3rhjsTYxDL+tGRZuPKlIPi9Hu0h35lqSSlR4EHq3jsU6BwlTlWLiXW3Q8qYIURF71TfM+OGwOkLZfZ3jEaDXIaNfa6zafa7ey1QJQgL1qHLiVK1WzUg1cObbxOr35C3+P2K1QSxllaZ8dQgNQ4PQs2Ujm+smRbMlNG/Uur4xEUGYcc8tMEYqV/B737liyZu6yNBAzH+ok89pmEko82Kkal8iggMwrm8rtGncABNW1oX1yxFEtBxpKQSrE6xeJ6wJ4oWYvWeLbE0VDvw9rJMpak0f8mhqc9ydHG9ZaBs3PIqPVChGrgPw1d7fRIf6L9/xq+J94fl63+9oH2+AMTIUM+9JwoSV0nzjnOFMILNG68KKszxiQP10FUJ+T7Eq+gfKISYi2MYkKjZYiaWsUkV1LUZ9sMtizhyYHC9bs2U/b5S2VvBP6mv3d1RcAJIz54ekJNj0R+k0LZ6ChDIvpntijKDPgTMahASgTeMIREcEY3SvFlhz8KIs/wytR1raI+QEy2eXnpzWBmXXa/DhjrNOd3ElFTVYd7gAxshQTE5rixax4S4zgWsxkvLu5HibnXb/JKMqQhm/s1+QdVJUMIgj3y2lKCqvweQvDwKoSyg7uFM89uQX4tJV9/kr8WjZ9C/0zJj+rBPqCEG/Jw2+J0MC9dg5/S6HzvisiAm84s2ZSx/uyiy0C2E/b5ROkB0TEYyX7r0FhrBgmwAEJQQeOXO+RaO/StU52zB4y3uJh4QyL0dKrplLV6sVSy8wY3AHPN4r0Wt2JM5MBXyG+VW7z0PMm+NSWSUWbjyJdx/uWs/8ZL97659kRL/2cRi5LAd7z5XIGIkyOPKX6Z4Yo2rS2sWbT2Px5tOICgvC6F6JyOjXGgF6nZOdrnt0iyXX64TsqLAg3NMpXlEzbkRwACqqax2OROumf1fPjBBCFRuuSIhEVpuqG2bUmjmZ5mPxa+C0747gnRFdJZ/N0bxxVuJICoXl1fjPlwdsNqhKCTzdE2MQFRZUz3fWFXod8EhqCwDONwzeGERDQpkXs/NMISokZlBXgjon2DCvEcgA106wHCDar0jo5eOuPEFycGRy/inXhLJK9asIlFyvwYKNJ7H8l3wMu/Xmejmb+IU/rUOcWwVYXjhTUjB7oncLLM4+o1rNSDWR65hv7/ekVY3gK+uP4tX7O0n+vZSySiUVNbhRa3bqJuEIV/NGSCsfHR4EDuLXH/u+2Qs8Uk2HAXodRvdqgQUbxUU/j+2TiOBAPdOGwVUZN61BQpkXo3aeMldwcFywmRVP+ACo5bdj//JxV54gOdzRNhaGsGDUmjkbQfKpldLLBzUMDcDVSnEbhZKKGry3Nb/e5/zCn9GvteT+yGHv2SLENQzBpavyNTvdWzTCuw8bVKkZqTZKPDPWx3BX/VmxHP69VNbvb2sRIyl46p2fT4t2dmeZN0KF04G6Db19uh0xWG9EzWYOc9Yfs5nXMRFBeGVIMgZ1SnB5rIx+bbD8l7NMa6NeB9zR7ibc0S7O8v5wtWHwliAaHhLKvBizMlUyZCNlJ+JIixQTEYyhKQnon2RUTUBTe5d++WqlW/IEKcHPJ6/g55NXLGbE8Xe0wks/5Mk6Zq9Wscg8qkwiXn7h/2znOUWOJxZTWRUmp7XFgo0nZR9rV34hnklv7/AlqfUdvBLPjPUxrM1rWkJuyaZ954olRbMf+r2EqV3Gna3QJq6haE2UI2GkV5tYvP5gR1n1W/mNqCNXmKLyGjy18gD+9XsJpg9KctnH1x/o6DQI6uaoUBSWV+N6jRnZx/9A9vE/EG8IRXpSHFNfC0ocF03XIpSnzIspq9SG1oXfibAilJOnqLwaH+04ixHLdqL3vGzBnD5yUCMnmTWNG4bKjqZy9yuaNyN2nfM/2QEISglkPFJMLUpSel0ph/+6u6pUzUh3Iqf2qFCeL3fUnxXLWCdFtVmQqlGsusEmyfVqfZNXzRsAeG9rPjYwuAC4mg+/l1Tieo2tFsJUWomPc9g2bAd+K2ZqpwVIKPNmNPRcsi5IrFqkgj9NV0oLZvwuHah/+eRezujwIHRPjJFl7hmakoB/e8hcp8XEtDxRYUEeme6rD1xQ5Dg9NOrEz0KAXof7Oos3r7L4PVkKjA9LwYzBHfDm3zsjJkK9IuNChATq0bvNTbKOIUej6Gx+i01g66gIuaM2crXirMz4PpcpMfLA5HhsefZO9GzJNk4xSklvystMQpkX0zwmwnUjN8G6IInVIqmR6ZzflRntdmVSSyHxFFfU4KfcAlmLc2JsA7ydLc5Z2B8Y3asFAPfvQ4oV0tLpdRraQYmk1sxh7SHnmyNHwzOEB7mMfOM1h/d3aYIxfVoiISoMReXu14wu+EeKbO2THI3i6F6JABxvFDkAw29rinWHLwoKWTyZuQXoPS8bI5btxMQvDgpaHdyR9Z+nsLyayZKy4fBFdJz9E3b+ym51YcVLFIsAyKfMq2kf19DTXRAdzi9Gi6RmpnNHTrCm0uuY/NUhWcd9auUBLBnZVbIjc106DoKHn18Z/dqgnbGh5nK8sXJZg2kgWGF5gTvypSqVINB6KoFutMwNGfCXRtFR0IozdDpg/B2t0M7YoN78NvwZrW0dnSiUjkJMagh3X2dX55u7IU/0dRNDys1Rqh1baUhT5sUUSfR3mXRXG0V8OaSE80vRIqm1gNj790itkGDPnPV5mDG4AwDxmh1TmfcJHGphPb8AwBAWjOcGtsetzaNEHec//Vp73NJfpBGhjMW0ZY+pVLqTtFhNN+v60DA0QGqXHPJjbgHz9RCCRaPoCI4D9pwtsjHnLhqegslpbVFaUVPPp9LkwLWDNTUEPz53pyVxdr51By+qKpABQEJ0uKrHVxLSlHkxUh+sxJsikP30Hej+2kZcFZGPKjhAh+ravx57KeH8UsLh3bWAKBWqX1BaibyCq5iU1gardp9XvJ7i+DtaIiY8BL8VV6B5TDiiwoLw9DeHFT2HJ7DP1WS0KuHTe162aA0ZXxvvQvF1j0fCyjWNK4HUrOdSq31I0XTzBaqdyUY6HZAzLQ23vpqFyhplQtA/zTmHT3POyUqKKsckmHOmEL1ax1o2irVmDr3nZTtNcm0d9c6Sf9H6XrgzLYkzf7gfDl3Ef76Ql8jc1XwR44+nBUgo82KkFrk+e6UCveZtEiWQAcDy0d2h1+lkhfOLyTatRKZzMbnQlMyEzSeR5EswHb1Yiv/lKROZ2LdNY5uXXM6ZQkWO62kWj+iC6IgQm3uVlWeSVC9UpwN2PZ+GsOAAzPw+V5X+ikEpLaxUhExbBQ5MW/bENJBXq1KMppulQDXHAR9u/1UxgcwaOVng5Wn0bQctVshiPTffjl/r5KTE4Lmr/U3YdPwPh9/pIGxJUcJkqUNdItn3t+YLVsvQcmJmR5D50osRGxWlQ11G+YUbT4p2po03hKJny0aKhPMLOdrb9xWQ90CxOr2K7ZsY+BJMSfGRso8lFIXFmuZDB+BffRMtVQW0xB3tYjGoU4LN/AIgOd/buD6JCAuuM3E1j/Gs6cLTO3VXEc8cnJsZjZHyngUxmm5W4WK5CrVZAcemPlbkaPR7tLDVJIoVsljPrYbV4YleLbFkZNd62uB4Q6igcLvhsHyTZXR4IN59uCumD0pymE7D2fm1DGnKvBgxPgzWmh8pL7nkJpGK7jasHe035pmw+uAFG0FRbqZzOfXQrPu24/QfWLz5jKQ+AH9d649zzko+BuBcSGXR8EWHB2HuAx0xMDkezw3sgMXZp7B8x1nRGb3VqkT584kryMwtsLknUsxBel3dztk6YeUjqS3w6oZjHgmL18JOXW7Wc6kaeSmablahQWomehakBhjJMQnqA2znh1ghy9W57e+FoikxdMCgTvFIT2ZLjFxr5vCiTO11g5BA7Hq+P4ID6/RKQtULvElDxkNCmRcj5qVlNIRi+G1NRdcY49l07DKqb5gtD4ES8P4Tqa0a4fnBSYo9UCxFx11VIeD7plSQgdwEqK6EVKFad+FBARjU0YjXHuhkuXcBeh0mprVFRr82Nte8uLyqXrkU+7IxMRHBKGTwMXqwSxM80O1m/JhbgM93skWU2t8T1ms/NCUBkWFBaB4TjkdSW9Sbo8GBeoztkyg+Ku7P/47rm1ivLicLMRFBeO3+jh7fqbM66gu1k2LqkqrpZhEuDOFBbkkoLPbZl+P+YF+kXayQ5ezcju6Fkikx+L4LVQ+wZ3d+key0J//39071nnPW82sdEsq8GNZF4+5kIxaP7Ip1hy9KPpeZAz7LOYsxMrNeC6HkAyXWH8MZni6ePCCpMUb3askkpPK7xcXZp7F8Rz5KrtegoqYW3+y/gB1nCusJdY6ueXpyvI2g1q15NPadKxadNqRvu5vQq3Us9Dods1AmtXD1sNuaubyPvOZs2bZ8G42ZTgfc2TYWMREhyDp2GaXXHWtrnxvYwXJdTl26xlR4esY9t3hcIAPYHfWdtRuYHC+q5JRUTTeLcDH69kRFSl+5QsqzL7Q5EnsusUKWs3M7uhdKRrSLvU5yz31Pp3hNPFdqQUKZF8P6MPyYa0JWnkm2gHGuqELW792FWH8MZ3i6eHKbxg1FCatZeSYs3HhSktkWcCyoSQkqsDersL6gxBSuFmsemz4oCU8PaI/Pcs5i66k/sP98Ca5W3kD2iSsAAGNkCJ7o1QYtYiPqaWutr8siVsFEpi+WUrA66rtql9GvNVbtPuc0mjgqLAjvjOqKni3l+5wKCRf9k4z4Ys951Z5JuQFG/OZo55lCPLViH0pdBFQJ+RyKEbLsz+3K6qDUZlOKv6Tcc+87V4xaM+eVpkkWSCjzYsQIDC/9kIctz94pS8DwtMM0K1KcXoWiND1dPDkq3HUqBb7vptLrmLP+mCyzrSuKy6uchqALmVVYTV+xEX8JBizaghkizd7BgXo0iQ7D1pNX6l2nS2VVWLjxFN59uKugIFxr5pgS/Hraud8aVuHQVbsAvQ6z77vF8iw4uh+vP9gRvVrHSuilLa6EC7WfSbl+gAF6HXq1icW8hzq5nPvOziXFV4rF6qDUZvO+zvGir5NUH0UeISuHmEh7LUPRl14M/9JieagKSiux71yxYN1HV+h1dQ7T3oCraET7KEZXUZquiuWqWasvtoFzocy675O/OuTUBGVttpVCZm4BJqw84NJh3pFZZdJdbdhOYnfThKJhjYZQjOubiDnr80RF14pNsmnP7vwiprxzw29rppkXAkv5H1Yh0tn9UDrSzVnxdks/IuWl67AnIiRA0XEMTI7H0oe7Oox4jg4PwlKGc6lRxN5VDWAd6pKMR4U5X9vWHioQHaXKn1vOKOytHFIi7bUKacq8nIHJ8XiiVwt8xBAifvlqJYakNHHsEB4cgIpq4YLUY/skKurkryZi/DFYozStd6ymskoUXatCTEQwjIYwmMoqMfnLg6qMxZnWT6jvrpDi08FSSF6vAxaPcPySSbyJrU6rvcMz4FhbUFxehQkrD4g208r1N2S9di1itaNVttf2svgnOUMrkW7WPpRK+ZgtHdkNfdrJK0xuj7U5M+fXKwDqhCw5Jl6l+uXMPGoIC8bCTc4Dw6SWwZPqe8djvS7KibTXIiSU+QD9k4xMQhk/kYUW1fmZx+o5QjtKMeANsPhjiI3SFDILqJq8VWDNZhGShJDi08ESrWXmhGsIys2jZH3txWY7t0auv6En80HJQYp/kjO0EulWF0lcp4WVK5g1CAnA7W3km14dwZsze6l0fKk4E7C/P3iB6RhSHfethdUJK/eLSnNSXF63eVMi0l5rkFDmA0hxiHa0qFo7Qp8rqhBMMeAtuNrRKxWlqWYwgCPNESAtpF2OA7NcYUZJp305902uUKV08IE70YqGSw2U0EwOu7WpT1wLsQgJ2O7YgAToddDrdaLzzs1Zf8wSKa5UpL1W8M63LWGDK/8AgN08ERyox5g+LfHykGSM6dPSawUyHmf+GEpFaTq7/jzR4UFYMrIrVvyzh0s/DWuEFjyxu1O5FRKU0HQpNUfl3Dex/ob2KDkOT6CGf5IWUEIzmZZkVKAnvoPcZ4UVKZo2XtBSMtJeK3j3G5ew4E4HXF9ByZ2g0PWPCgvC5LQ22PtifwzqFI9erWPx+oMdXR7P1YIn9iUkdx4osUArNUfl3DclhCp61rQHa6kxRyglXPga7tqASBWoeW2vmufwBDqO45S2uBAClJWVwWAwoLS0FJGR8mshOsJXwoLdAe+b5MoUtX1qP+ZryHr9M3MLMO27Iw4zk/Otnb3gXfUdqIsKnXHPLTBGKjMPeIdawLGzOKtAIneOKnHfMnML6vlXxYv0r6JnTVs4m5+cg//n/wbY564/osSz4gyWtcwRq8b2RPfEGMXXcCHc8f4GSChzK+66qQQ7SgkaUqg1czbZ93lYFzxP9F3tBVpMP+SOnYQq38PZ/ASgibnrjaj9rAg9z46wF7TctQ6SUOaDkFCmTTwtaMhZ8DzRd60IM56+b4Q2cTY/tTJ3ifo4ep7tERK03LEWkFDmg5BQpl28ebH25r7LxZ/HThC+hvXzfPZKBVbtPg9TGZugpfZaQEKZD0JCGUEQBEGwoaVNl7ve35SnjCAIgiAIzaGVJMXuhFJiEARBEARBaAASykTyzjvvoEWLFggNDUWPHj2we/duT3eJIAiCIAgfgIQyEXz55ZeYMmUKZs2ahf3796Nz585IT0/H5cuXPd01giAIgiC8HBLKRPDWW29h7NixGD16NJKSkrB06VKEh4fjo48+8nTXCIIgCILwckgoY6S6uhr79u1DWlqa5TO9Xo+0tDTk5OQ4/E1VVRXKysps/hEEQRAEQTiChDJGrly5gtraWsTFxdl8HhcXB5PJ5PA3c+fOhcFgsPxr2rSpO7pKEARBEIQXQkKZikyfPh2lpaWWf7/99punu0QQBEEQhEahPGWMxMbGIiAgAJcuXbL5/NKlSzAajQ5/ExISgpCQEHd0jyAIgiAIL4c0ZYwEBwejW7du2LRpk+Uzs9mMTZs2ITU11YM9IwiCIAjCFyBNmQimTJmCxx57DLfeeiu6d++OhQsXory8HKNHj2b6PV/Rihz+CYIgCMJ74N/balemJKFMBMOGDcMff/yBmTNnwmQyISUlBZmZmfWc/4W4evUqAJDDP0EQBEF4IVevXoXBYFDt+FSQ3I2YzWZcvHgRDRs2hE4nr6hqWVkZmjZtit9++80vipv703hprL6LP43Xn8YK+Nd4/XGs58+fh06nQ0JCAvR69Ty/SFPmRvR6PW6++WZFjxkZGenzD4U1/jReGqvv4k/j9aexAv41Xn8aq8FgcMtYydGfIAiCIAhCA5BQRhAEQRAEoQFIKPNSQkJCMGvWLL/Jg+ZP46Wx+i7+NF5/GivgX+OlsaoHOfoTBEEQBEFoANKUEQRBEARBaAASygiCIAiCIDQACWUEQRAEQRAagIQygiAIgiAIDUBCmQeZO3cubrvtNjRs2BCNGzfG0KFDceLECZs2lZWVmDBhAho1aoQGDRrgwQcfxKVLl2zanD9/HoMHD0Z4eDgaN26MZ599Fjdu3HB4zh07diAwMBApKSlqDcsh7hrrzz//DJ1OV++fyWRyyzh53Hlvq6qq8MILL6B58+YICQlBixYt8NFHH6k+Rh53jfXxxx93eG9vueUWt4wTcO99XbFiBTp37ozw8HDEx8fjiSeeQGFhoepjtMad433nnXfQoUMHhIWFoV27dvj0009VH581So31P//5D7p164aQkBDBdfbw4cPo06cPQkND0bRpU8yfP1+tYTnEXWOtrKzE448/jo4dOyIwMBBDhw5VcVTCuGu8P//8M4YMGYL4+HhEREQgJSUFK1asENdZjvAY6enp3PLly7nc3Fzu4MGD3KBBg7hmzZpx165ds7R58sknuaZNm3KbNm3i9u7dy/Xs2ZO7/fbbLd/fuHGDS05O5tLS0rgDBw5wGzZs4GJjY7np06fXO19xcTHXsmVLbsCAAVznzp3dMUQL7hrr5s2bOQDciRMnuIKCAsu/2tpanxwvx3Hcfffdx/Xo0YPLysri8vPzuV9++YXbvn27z421pKTE5p7+9ttvXExMDDdr1iyfG+v27ds5vV7PLVq0iPv111+5bdu2cbfccgt3//33u22s7hzvkiVLuIYNG3JffPEFd+bMGW7VqlVcgwYNuLVr13rVWDmO4/79739zixcv5h555BGH62xpaSkXFxfHjRo1isvNzeVWrVrFhYWFce+9957aQ7TgrrFeu3aNe/LJJ7n333+fS09P54YMGaLyyBzjrvG++uqr3Isvvsjt2LGDO336NLdw4UJOr9dzP/zwA3NfSSjTEJcvX+YAcFu2bOE4ru4lFBQUxH399deWNseOHeMAcDk5ORzHcdyGDRs4vV7PmUwmS5t3332Xi4yM5KqqqmyOP2zYMO7FF1/kZs2a5XahzB61xsoLZcXFxe4bDANqjffHH3/kDAYDV1hY6MbROEftecyzevVqTqfTcWfPnlVxNM5Ra6xvvPEG17JlS5tzvf3221yTJk3UHpJT1Bpvamoq98wzz9ica8qUKVyvXr3UHpIgUsZqjdA6u2TJEi46OtpmXk+dOpVr166d8oNgRK2xWvPYY495TCizxx3j5Rk0aBA3evRo5r6R+VJDlJaWAgBiYmIAAPv27UNNTQ3S0tIsbdq3b49mzZohJycHAJCTk4OOHTsiLi7O0iY9PR1lZWU4evSo5bPly5fj119/xaxZs9wxFJeoOVYASElJQXx8PPr3748dO3aoPRyXqDXetWvX4tZbb8X8+fPRpEkTtG3bFs888wyuX7/urqHVQ+17y/Phhx8iLS0NzZs3V2soLlFrrKmpqfjtt9+wYcMGcByHS5cu4ZtvvsGgQYPcNTSHqDXeqqoqhIaG2pwrLCwMu3fvRk1NjapjEkLKWFnIyclB3759ERwcbPksPT0dJ06cQHFxsUK9F4daY9Uq7hxvaWmp5TwskFCmEcxmMyZNmoRevXohOTkZAGAymRAcHIyoqCibtnFxcRYfKZPJZLPY8d/z3wHAqVOnMG3aNHz++ecIDPR8DXo1xxofH4+lS5fi22+/xbfffoumTZvijjvuwP79+1UelTBqjvfXX3/F9u3bkZubi9WrV2PhwoX45ptv8NRTT6k8KseoOVZrLl68iB9//BH//Oc/VRgFG2qOtVevXlixYgWGDRuG4OBgGI1GGAwGvPPOOyqPShg1x5ueno4PPvgA+/btA8dx2Lt3Lz744APU1NTgypUrKo+sPlLHyoLYua42ao5Vi7hzvF999RX27NmD0aNHM//G829oAgAwYcIE5ObmYvv27Yoet7a2FiNHjsRLL72Etm3bKnpsqag1VgBo164d2rVrZ/n79ttvx5kzZ7BgwQJ89tlnip+PBTXHazabodPpsGLFChgMBgDAW2+9hYceeghLlixBWFiY4ud0hppjteaTTz5BVFSUxxyHAXXHmpeXh4kTJ2LmzJlIT09HQUEBnn32WTz55JP48MMPFT8fC2qOd8aMGTCZTOjZsyc4jkNcXBwee+wxzJ8/H3q9+3UH7prHWsCfxgq4b7ybN2/G6NGjsWzZMlHBSKQp0wAZGRlYt24dNm/ejJtvvtnyudFoRHV1NUpKSmzaX7p0CUaj0dLGPkKE/9toNOLq1avYu3cvMjIyEBgYiMDAQLz88ss4dOgQAgMDkZ2dre7g7FBzrEJ0794dp0+fVmgE4lB7vPHx8WjSpIlFIAOADh06gOM4/P7772oMSRB33VuO4/DRRx/hkUcesTEBuRO1xzp37lz06tULzz77LDp16oT09HQsWbIEH330EQoKClQcmWPUHm9YWBg++ugjVFRU4OzZszh//jxatGiBhg0b4qabblJxZPWRM1YWpK5jaqD2WLWGu8a7ZcsW3HvvvViwYAEeffRRcT9m9j4jFMdsNnMTJkzgEhISuJMnT9b7nnc+/OabbyyfHT9+3KET7aVLlyxt3nvvPS4yMpKrrKzkamtruSNHjtj8Gz9+PNeuXTvuyJEjNtEnauKOsQqRlpbm9qg1d433vffe48LCwrirV69a2qxZs4bT6/VcRUWFWsOzwd33lg/mOHLkiEojEsZdY33ggQe4f/zjHzbH/uWXXzgA3IULF9QYmkM8+dz27duXGzFihIKjcY4SY7XGlaN/dXW15bPp06e71dHfXWO1xpOO/u4c7+bNm7mIiAhu8eLFkvpKQpkHGT9+PGcwGLiff/7ZJtTf+mX65JNPcs2aNeOys7O5vXv3cqmpqVxqaqrlez7cfMCAAdzBgwe5zMxM7qabbnKYEoPHE9GX7hrrggULuDVr1nCnTp3ijhw5wk2cOJHT6/Xcxo0bfXK8V69e5W6++WbuoYce4o4ePcpt2bKFa9OmDffPf/7T58bK8/DDD3M9evRwy9jscddYly9fzgUGBnJLlizhzpw5w23fvp279dZbue7du/vkeE+cOMF99tln3MmTJ7ldu3Zxw4YN42JiYrj8/HyvGivHcdypU6e4AwcOcP/617+4tm3bcgcOHOAOHDhgibYsKSnh4uLiuEceeYTLzc3lvvjiCy48PNytKTHcNVaO47ijR49yBw4c4O69917ujjvusLRxJ+4ab3Z2NhceHs5Nnz7d5jxiouNJKPMgABz+W758uaXN9evXuaeeeoqLjo7mwsPDufvvv58rKCiwOc7Zs2e5u+++mwsLC+NiY2O5p59+mqupqRE8ryeEMneNdd68eVyrVq240NBQLiYmhrvjjju47Oxsdw3Tgjvv7bFjx7i0tDQuLCyMu/nmm7kpU6a4TUvGce4da0lJCRcWFsa9//777hhaPdw51rfffptLSkriwsLCuPj4eG7UqFHc77//7o5hWnDXePPy8riUlBQuLCyMi4yM5IYMGcIdP37cXcPkOE65sf7tb39zeBxrAfPQoUNc7969uZCQEK5Jkybc66+/7qZR1uHOsTZv3txhG3firvE+9thjDr//29/+xtxX3Z8dJgiCIAiCIDwIOfoTBEEQBEFoABLKCIIgCIIgNAAJZQRBEARBEBqAhDKCIAiCIAgNQEIZQRAEQRCEBiChjCAIgiAIQgOQUEYQBEEQBKEBSCgjCIIgCILQACSUEQRB/MnHH3+MqKgoT3eDIAg/hYQygiB8gj/++APjx49Hs2bNEBISAqPRiPT0dOzYscOt/Zg9ezZ0Oh10Oh0CAwMRGxuLvn37YuHChaiqqhJ1rJ9//hk6nQ4lJSXqdJYgCE0R6OkOEARBKMGDDz6I6upqfPLJJ2jZsiUuXbqETZs2obCw0O19ueWWW7Bx40aYzWYUFhbi559/xiuvvILPPvsMP//8Mxo2bOj2PhEEoX1IU0YQhNdTUlKCbdu2Yd68ebjzzjvRvHlzdO/eHdOnT8d9991naffWW2+hY8eOiIiIQNOmTfHUU0/h2rVrTo/9/fffo2vXrggNDUXLli3x0ksv4caNG05/ExgYCKPRiISEBHTs2BH//ve/sWXLFuTm5mLevHmWdp999hluvfVWNGzYEEajESNHjsTly5cBAGfPnsWdd94JAIiOjoZOp8Pjjz8OADCbzZg7dy4SExMRFhaGzp0745tvvpFy6QiC0BAklBEE4fU0aNAADRo0wJo1a5yaCPV6Pd5++20cPXoUn3zyCbKzs/Hcc88Jtt+2bRseffRRTJw4EXl5eXjvvffw8ccf49VXXxXdx/bt2+Puu+/Gd999Z/mspqYGc+bMwaFDh7BmzRqcPXvWIng1bdoU3377LQDgxIkTKCgowKJFiwAAc+fOxaeffoqlS5fi6NGjmDx5Mh5++GFs2bJFdL8IgtAQHEEQhA/wzTffcNHR0VxoaCh3++23c9OnT+cOHTrk9Ddff/0116hRI8vfy5cv5wwGg+Xvu+66i3vttddsfvPZZ59x8fHxgsecNWsW17lzZ4ffTZ06lQsLCxP87Z49ezgA3NWrVzmO47jNmzf/fzv3D5JcG4YB/LJIU9JAEArJCgShIcIhKIKGkBOhBP0ZGsLIghqDgoK3JCuIoD/gEk0h1NAUUuDQUJBNRgRCBQXhUC0SQYlYeX/Dyyf4va9BNHS+uH7bwee+vT3TxXOeowCQx8fH3Jp0Oi0Gg0FOTk7yan0+n/T19RXsTUTqx50yIvoRuru7cXd3h3A4jPb2dhweHsLpdGJzczO35uDgAG1tbbBarTAajejv70cymUQqlfprz/PzcwQCgdxOXFlZGYaHh3F/f1+w5iMiAo1Gk7s+PT2Fx+OBzWaD0WhEa2srACCRSBTscX19jVQqBZfLlTdXKBTCzc3Np2ciIvXgQX8i+jFKS0vhcrngcrkwPT2NoaEh+P1+DAwM4Pb2Fm63G6Ojo1hYWIDZbMbx8TF8Ph8ymQwMBsMf/Z6fnzE7O4uurq6/ftdnXVxcoLa2FgDw8vICRVGgKAq2trZgsViQSCSgKAoymUzBHv+egdvf34fVas37TKfTfXomIlIPhjIi+rHq6uqwu7sL4PeuVDabxfLyMoqKfj8k2NnZ+bDe6XTi6uoKdrv9y7NcXl4iEolgamoqd51MJrG4uIiqqioAQCwWy6vRarUAgPf397zfpNPpkEgkcjtrRPQzMJQR0f9eMplEb28vBgcHUV9fD6PRiFgshqWlJXR2dgIA7HY7Xl9fEQwG4fF4EI1Gsb6+/mHfmZkZuN1u2Gw29PT0oKioCOfn54jH45ifny9Y9/b2hoeHhz/+EqOhoQETExMAAJvNBq1Wi2AwiJGREcTjcczNzeX1qa6uhkajwd7eHjo6OqDX62E0GjE+Po6xsTFks1m0tLTg6ekJ0WgUJpMJXq/3i3eTiL7Ndx9qIyL6qnQ6LZOTk+J0OqW8vFwMBoM4HA759euXpFKp3LqVlRWprKwUvV4viqJIKBTKO0j/34P+IiKRSESam5tFr9eLyWSSxsZG2djYKDiL3+8XAAJAiouLxWw2S0tLi6yurko6nc5bu729LTU1NaLT6aSpqUnC4bAAkLOzs9yaQCAgFRUVotFoxOv1iohINpuVtbU1cTgcUlJSIhaLRRRFkaOjoy/dRyL6XhoRke+NhURERETEty+JiIiIVIChjIiIiEgFGMqIiIiIVIChjIiIiEgFGMqIiIiIVIChjIiIiEgFGMqIiIiIVIChjIiIiEgFGMqIiIiIVIChjIiIiEgFGMqIiIiIVOAf0yq3c9g8DMcAAAAASUVORK5CYII=", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df[\"saledate\"][:1000], # visualize the first 1000 values\n", - " y=df[\"SalePrice\"][:1000])\n", - "ax.set_xlabel(\"Sale Date\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.2 Sorting our DataFrame by saledate\n", - "\n", - "Now we've formatted our `saledate` column to be NumPy `datetime64[ns]` objects, we can use built-in pandas methods such as `sort_values` to sort our DataFrame by date.\n", - "\n", - "And considering this is a time series problem, sorting our DataFrame by date has the added benefit of making sure our data is sequential.\n", - "\n", - "In other words, we want to use examples from the past (example sale prices from previous dates) to try and predict future bulldozer sale prices. \n", - "\n", - "Let's use the [`pandas.DataFrame.sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort our DataFrame by `saledate` in ascending order." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(205615 1989-01-17\n", - " 274835 1989-01-31\n", - " 141296 1989-01-31\n", - " 212552 1989-01-31\n", - " 62755 1989-01-31\n", - " 54653 1989-01-31\n", - " 81383 1989-01-31\n", - " 204924 1989-01-31\n", - " 135376 1989-01-31\n", - " 113390 1989-01-31\n", - " Name: saledate, dtype: datetime64[ns],\n", - " 409202 2012-04-28\n", - " 408976 2012-04-28\n", - " 411695 2012-04-28\n", - " 411319 2012-04-28\n", - " 408889 2012-04-28\n", - " 410879 2012-04-28\n", - " 412476 2012-04-28\n", - " 411927 2012-04-28\n", - " 407124 2012-04-28\n", - " 409203 2012-04-28\n", - " Name: saledate, dtype: datetime64[ns])" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Sort DataFrame in date order\n", - "df.sort_values(by=[\"saledate\"], inplace=True, ascending=True)\n", - "df.saledate.head(10), df.saledate.tail(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "Looks like our older samples are now coming first and the newer samples are towards the end of the DataFrame." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.3 Adding extra features to our DataFrame\n", - "\n", - "One way to potentially increase the predictive power of our data is to enhance it with more features.\n", - "\n", - "This practice is known as [**feature engineering**](https://en.wikipedia.org/wiki/Feature_engineering), taking existing features and using them to create more/different features. \n", - "\n", - "There is no set in stone way to do feature engineering and often it takes quite a bit of practice/exploration/experimentation to figure out what might work and what won't.\n", - "\n", - "For now, we'll use our `saledate` column to add extra features such as:\n", - "\n", - "* Year of sale\n", - "* Month of sale\n", - "* Day of sale\n", - "* Day of week sale (e.g. Monday = 1, Tuesday = 2)\n", - "* Day of year sale (e.g. January 1st = 1, January 2nd = 2)\n", - "\n", - "Since we're going to be manipulating the data, we'll make a copy of the original DataFrame and perform our changes there.\n", - "\n", - "This will keep the original DataFrame in tact if we need it again." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "# Make a copy of the original DataFrame to perform edits on\n", - "df_tmp = df.copy()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because we imported the data using `read_csv()` and we asked pandas to parse the dates using `parase_dates=[\"saledate\"]`, we can now access the [different datetime attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) of the `saledate` column.\n", - "\n", - "Let's use these attributes to add a series of different feature columns to our dataset. \n", - "\n", - "After we've added these extra columns, we can remove the original `saledate` column as its information will be dispersed across these new columns." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [], - "source": [ - "# Add datetime parameters for saledate\n", - "df_tmp[\"saleYear\"] = df_tmp.saledate.dt.year\n", - "df_tmp[\"saleMonth\"] = df_tmp.saledate.dt.month\n", - "df_tmp[\"saleDay\"] = df_tmp.saledate.dt.day\n", - "df_tmp[\"saleDayofweek\"] = df_tmp.saledate.dt.dayofweek\n", - "df_tmp[\"saleDayofyear\"] = df_tmp.saledate.dt.dayofyear\n", - "\n", - "# Drop original saledate column\n", - "df_tmp.drop(\"saledate\", axis=1, inplace=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We could add more of these style of columns, such as, whether it was the start or end of a quarter (the sale being at the end of a quarter may bye influenced by things such as quarterly budgets) but these will do for now.\n", - "\n", - "> **Challenge:** See what other [datetime attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) you can add to `df_tmp` using a similar technique to what we've used above. Hint: check the bottom of the [`pandas.DatetimeIndex` docs](https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html).\n", - "\n", - "How about we view some of our newly created columns?" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
2056159500.01989117117
27483514000.01989131131
14129650000.01989131131
21255216000.01989131131
6275522000.01989131131
\n", - "
" - ], - "text/plain": [ - " SalePrice saleYear saleMonth saleDay saleDayofweek saleDayofyear\n", - "205615 9500.0 1989 1 17 1 17\n", - "274835 14000.0 1989 1 31 1 31\n", - "141296 50000.0 1989 1 31 1 31\n", - "212552 16000.0 1989 1 31 1 31\n", - "62755 22000.0 1989 1 31 1 31" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# View newly created columns\n", - "df_tmp[[\"SalePrice\", \"saleYear\", \"saleMonth\", \"saleDay\", \"saleDayofweek\", \"saleDayofyear\"]].head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Cool!\n", - "\n", - "Now we've broken our `saledate` column into columns/features, we can perform further exploratory analysis such as visualizing the `SalePrice` against the `saleMonth`.\n", - "\n", - "How about we view the first 10,000 samples (we could also randomly select 10,000 samples too) to see if reveals anything about which month has the highest sales?" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# View 10,000 samples SalePrice against saleMonth\n", - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df_tmp[\"saleMonth\"][:10000], # visualize the first 10000 values\n", - " y=df_tmp[\"SalePrice\"][:10000])\n", - "ax.set_xlabel(\"Sale Month\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... doesn't look like there's too much conclusive evidence here about which month has the highest sales value.\n", - "\n", - "How about we plot the median sale price of each month?\n", - "\n", - "We can do so by grouping on the `saleMonth` column with [`pandas.DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) and then getting the median of the `SalePrice` column." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Group DataFrame by saleMonth and then find the median SalePrice\n", - "df_tmp.groupby([\"saleMonth\"])[\"SalePrice\"].median().plot()\n", - "plt.xlabel(\"Month\")\n", - "plt.ylabel(\"Median Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ohhh it looks like the median sale prices of January and February (months 1 and 2) are quite a bit higher than the other months of the year.\n", - "\n", - "Could this be because of New Year budget spending?\n", - "\n", - "Perhaps... but this would take a bit more investigation.\n", - "\n", - "In the meantime, there are many other values we could look further into." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.4 Inspect values of other columns\n", - "\n", - "When first exploring a new problem, it's often a good idea to become as familiar with the data as you can.\n", - "\n", - "Of course, with a dataset that has over 400,000 samples, it's unlikely you'll ever get through every sample.\n", - "\n", - "But that's where the power of data analysis and machine learning can help.\n", - "\n", - "We can use pandas to aggregate thousands of samples into smaller more managable pieces.\n", - "\n", - "And as we'll see later on, we can use machine learning models to model the data and then later inspect which features the model thought were most important.\n", - "\n", - "How about we see which states sell the most bulldozers?" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "state\n", - "Florida 67320\n", - "Texas 53110\n", - "California 29761\n", - "Washington 16222\n", - "Georgia 14633\n", - "Maryland 13322\n", - "Mississippi 13240\n", - "Ohio 12369\n", - "Illinois 11540\n", - "Colorado 11529\n", - "Name: count, dtype: int64" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check the different values of different columns\n", - "df_tmp.state.value_counts()[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! Looks like Flordia sells a fair few bulldozers.\n", - "\n", - "How about we go even further and group our samples by `state` and then find the median `SalePrice` per state?\n", - "\n", - "We also compare this to the median `SalePrice` for all samples." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Group DataFrame by saleMonth and then find the median SalePrice per state as well as across the whole dataset\n", - "median_prices_by_state = df_tmp.groupby([\"state\"])[\"SalePrice\"].median() # this will return a pandas Series rather than a DataFrame\n", - "median_sale_price = df_tmp[\"SalePrice\"].median()\n", - "\n", - "# Create a plot comparing median sale price per state to median sale price overall\n", - "plt.figure(figsize=(10, 7))\n", - "plt.bar(x=median_prices_by_state.index, # Because we're working with a Series, we can use the index (state names) as the x values\n", - " height=median_prices_by_state.values)\n", - "plt.xlabel(\"State\")\n", - "plt.ylabel(\"Median Sale Price ($)\")\n", - "plt.xticks(rotation=90, fontsize=7);\n", - "plt.axhline(y=median_sale_price, \n", - " color=\"red\", \n", - " linestyle=\"--\", \n", - " label=f\"Median Sale Price: ${median_sale_price:,.0f}\")\n", - "plt.legend();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that's a nice looking figure!\n", - "\n", - "Interestingly Florida has the most sales and the median sale price is above the overall median of all other states.\n", - "\n", - "And if you had a bulldozer and were chasing the highest sale price, the data would reveal that perhaps selling in South Dakota would be your best bet.\n", - "\n", - "Perhaps bulldozers are in higher demand in South Dakota because of a building or mining boom?\n", - "\n", - "Answering this would require a bit more research.\n", - "\n", - "But what we're doing here is slowly building up a mental model of our data. \n", - "\n", - "So that if we saw an example in the future, we could compare its values to the ones we've already seen." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Model driven data exploration\n", - "\n", - "We've performed a small Exploratory Data Analysis (EDA) as well as enriched it with some `datetime` attributes, now let's try to model it.\n", - "\n", - "Why model so early?\n", - "\n", - "Well, we know the evaluation metric (root mean squared log error or RMSLE) we're heading towards. \n", - "\n", - "We could spend more time doing EDA, finding more out about the data ourselves but what we'll do instead is use a machine learning model to help us do EDA whilst simultaneously working towards the best evaluation metric we can get. \n", - "\n", - "Remember, one of the biggest goals of starting any new machine learning project is reducing the time between experiments.\n", - "\n", - "Following the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/machine_learning_map.html) and taking into account the fact we've got over 100,000 examples, we find a [`sklearn.linear_model.SGDRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) or a [`sklearn.ensemble.RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor) model might be a good candidate.\n", - "\n", - "Since we're worked with the Random Forest algorithm before (on the [heart disease classification problem](https://dev.mrdbourke.com/zero-to-mastery-ml/end-to-end-heart-disease-classification/)), let's try it out on our regression problem.\n", - "\n", - "> **Note:** We're trying just one model here for now. But you can try many other kinds of models from the Scikit-Learn library, they mostly work with a similar API. There are even libraries such as [`LazyPredict`](https://github.com/shankarpandala/lazypredict) which will try many models simultaneously and return a table with the results." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "could not convert string to float: 'Low'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/2824176890.py\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# This won't work since we've got missing numbers and categories\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mensemble\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m model.fit(X=df_tmp.drop(\"SalePrice\", axis=1), # use all columns except SalePrice as X input\n\u001b[0m\u001b[1;32m 6\u001b[0m y=df_tmp.SalePrice) # use SalePrice column as y input\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1469\u001b[0m skip_parameter_validation=(\n\u001b[1;32m 1470\u001b[0m \u001b[0mprefer_skip_nested_validation\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mglobal_skip_validation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1471\u001b[0m )\n\u001b[1;32m 1472\u001b[0m ):\n\u001b[0;32m-> 1473\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfit_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0;31m# Validate or convert input data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 360\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 361\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"sparse multilabel-indicator for y is not supported.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 362\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 363\u001b[0;31m X, y = self._validate_data(\n\u001b[0m\u001b[1;32m 364\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 365\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 366\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 646\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"estimator\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcheck_y_params\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 647\u001b[0m \u001b[0mcheck_y_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m**\u001b[0m\u001b[0mdefault_check_params\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 648\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"y\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 649\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 650\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 651\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 652\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 653\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mcheck_params\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"ensure_2d\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 1297\u001b[0m raise ValueError(\n\u001b[1;32m 1298\u001b[0m \u001b[0;34mf\"{estimator_name} requires y to be passed, but the target y is None\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1299\u001b[0m )\n\u001b[1;32m 1300\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1301\u001b[0;31m X = check_array(\n\u001b[0m\u001b[1;32m 1302\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1303\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1304\u001b[0m \u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m )\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m ) from complex_warning\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m ) -> np.ndarray:\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Low'" - ] - } - ], - "source": [ - "# This won't work since we've got missing numbers and categories\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "\n", - "model = RandomForestRegressor(n_jobs=-1)\n", - "model.fit(X=df_tmp.drop(\"SalePrice\", axis=1), # use all columns except SalePrice as X input\n", - " y=df_tmp.SalePrice) # use SalePrice column as y input" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Oh no!\n", - "\n", - "When we try to fit our model to the data, we get a value error similar to:\n", - "\n", - "> `ValueError: could not convert string to float: 'Low'`\n", - "\n", - "The problem here is that some of the features of our data are in string format and machine learning models love numbers.\n", - "\n", - "Not to mention some of our samples have missing values.\n", - "\n", - "And typically, machine learning models require all data to be in numerical format as well as all missing values to be filled.\n", - "\n", - "Let's start to fix this by inspecting the different datatypes in our DataFrame.\n", - "\n", - "We can do so using the [`pandas.DataFrame.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method, this will give us the different datatypes as well as how many non-null (a null value is generally a missing value) in our `df_tmp` DataFrame.\n", - "\n", - "> **Note:** There are some ML models such as [`sklearn.ensemble.HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html), [CatBoost](https://catboost.ai/) and [XGBoost](https://xgboost.ai/) which can handle missing values, however, I'll leave exploring each of these as extra-curriculum/extensions." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Index: 412698 entries, 205615 to 409203\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 fiModelDesc 412698 non-null object \n", - " 10 fiBaseModel 412698 non-null object \n", - " 11 fiSecondaryDesc 271971 non-null object \n", - " 12 fiModelSeries 58667 non-null object \n", - " 13 fiModelDescriptor 74816 non-null object \n", - " 14 ProductSize 196093 non-null object \n", - " 15 fiProductClassDesc 412698 non-null object \n", - " 16 state 412698 non-null object \n", - " 17 ProductGroup 412698 non-null object \n", - " 18 ProductGroupDesc 412698 non-null object \n", - " 19 Drive_System 107087 non-null object \n", - " 20 Enclosure 412364 non-null object \n", - " 21 Forks 197715 non-null object \n", - " 22 Pad_Type 81096 non-null object \n", - " 23 Ride_Control 152728 non-null object \n", - " 24 Stick 81096 non-null object \n", - " 25 Transmission 188007 non-null object \n", - " 26 Turbocharged 81096 non-null object \n", - " 27 Blade_Extension 25983 non-null object \n", - " 28 Blade_Width 25983 non-null object \n", - " 29 Enclosure_Type 25983 non-null object \n", - " 30 Engine_Horsepower 25983 non-null object \n", - " 31 Hydraulics 330133 non-null object \n", - " 32 Pushblock 25983 non-null object \n", - " 33 Ripper 106945 non-null object \n", - " 34 Scarifier 25994 non-null object \n", - " 35 Tip_Control 25983 non-null object \n", - " 36 Tire_Size 97638 non-null object \n", - " 37 Coupler 220679 non-null object \n", - " 38 Coupler_System 44974 non-null object \n", - " 39 Grouser_Tracks 44875 non-null object \n", - " 40 Hydraulics_Flow 44875 non-null object \n", - " 41 Track_Type 102193 non-null object \n", - " 42 Undercarriage_Pad_Width 102916 non-null object \n", - " 43 Stick_Length 102261 non-null object \n", - " 44 Thumb 102332 non-null object \n", - " 45 Pattern_Changer 102261 non-null object \n", - " 46 Grouser_Type 102193 non-null object \n", - " 47 Backhoe_Mounting 80712 non-null object \n", - " 48 Blade_Type 81875 non-null object \n", - " 49 Travel_Controls 81877 non-null object \n", - " 50 Differential_Type 71564 non-null object \n", - " 51 Steering_Controls 71522 non-null object \n", - " 52 saleYear 412698 non-null int32 \n", - " 53 saleMonth 412698 non-null int32 \n", - " 54 saleDay 412698 non-null int32 \n", - " 55 saleDayofweek 412698 non-null int32 \n", - " 56 saleDayofyear 412698 non-null int32 \n", - "dtypes: float64(3), int32(5), int64(5), object(44)\n", - "memory usage: 174.7+ MB\n" - ] - } - ], - "source": [ - "# Check for missing values and different datatypes \n", - "df_tmp.info();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, it seems as though we've got a fair few different datatypes. \n", - "\n", - "There are `int64` types such as `MachineID`.\n", - "\n", - "There are `float64` types such as `SalePrice`.\n", - "\n", - "And there are `object` (the `object` dtype can hold any Python object, including strings) types such as `UseageBand`.\n", - "\n", - "> **Resource:** You can see a list of all the [pandas dtypes in the pandas user guide](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes).\n", - "\n", - "How about we find out how many missing values are in each column?\n", - "\n", - "We can do so using [`pandas.DataFrame.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) (`isna` stands for 'is null or NaN') which will return a boolean `True`/`False` if a value is missing (`True` if missing, `False` if not). \n", - "\n", - "Let's start by checking the missing values in the head of our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
205615FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
274835FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...TrueTrueTrueFalseFalseFalseFalseFalseFalseFalse
141296FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
212552FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...TrueTrueTrueFalseFalseFalseFalseFalseFalseFalse
62755FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
\n", - "

5 rows × 57 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "205615 False False False False False False \n", - "274835 False False False False False False \n", - "141296 False False False False False False \n", - "212552 False False False False False False \n", - "62755 False False False False False False \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "205615 False True True False ... \n", - "274835 False True True False ... \n", - "141296 False True True False ... \n", - "212552 False True True False ... \n", - "62755 False True True False ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "205615 False False False True \n", - "274835 True True True False \n", - "141296 False False False True \n", - "212552 True True True False \n", - "62755 False False False True \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek \\\n", - "205615 True False False False False \n", - "274835 False False False False False \n", - "141296 True False False False False \n", - "212552 False False False False False \n", - "62755 True False False False False \n", - "\n", - " saleDayofyear \n", - "205615 False \n", - "274835 False \n", - "141296 False \n", - "212552 False \n", - "62755 False \n", - "\n", - "[5 rows x 57 columns]" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find missing values in the head of our DataFrame \n", - "df_tmp.head().isna()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Alright it seems as though we've got some missing values in the `MachineHoursCurrentMeter` as well as the `UsageBand` and a few other columns.\n", - "\n", - "But so far we've only viewed the first few rows.\n", - "\n", - "It'll be very time consuming to go through each row one by one so how about we get the total missing values per column?\n", - "\n", - "We can do so by calling `.isna()` on the whole DataFrame and then chaining it together with `.sum()`.\n", - "\n", - "Doing so will give us the total `True`/`False` values in a given column (when summing, `True` = 1, `False` = 0)." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "SalesID 0\n", - "SalePrice 0\n", - "MachineID 0\n", - "ModelID 0\n", - "datasource 0\n", - "auctioneerID 20136\n", - "YearMade 0\n", - "MachineHoursCurrentMeter 265194\n", - "UsageBand 339028\n", - "fiModelDesc 0\n", - "fiBaseModel 0\n", - "fiSecondaryDesc 140727\n", - "fiModelSeries 354031\n", - "fiModelDescriptor 337882\n", - "ProductSize 216605\n", - "fiProductClassDesc 0\n", - "state 0\n", - "ProductGroup 0\n", - "ProductGroupDesc 0\n", - "Drive_System 305611\n", - "Enclosure 334\n", - "Forks 214983\n", - "Pad_Type 331602\n", - "Ride_Control 259970\n", - "Stick 331602\n", - "Transmission 224691\n", - "Turbocharged 331602\n", - "Blade_Extension 386715\n", - "Blade_Width 386715\n", - "Enclosure_Type 386715\n", - "Engine_Horsepower 386715\n", - "Hydraulics 82565\n", - "Pushblock 386715\n", - "Ripper 305753\n", - "Scarifier 386704\n", - "Tip_Control 386715\n", - "Tire_Size 315060\n", - "Coupler 192019\n", - "Coupler_System 367724\n", - "Grouser_Tracks 367823\n", - "Hydraulics_Flow 367823\n", - "Track_Type 310505\n", - "Undercarriage_Pad_Width 309782\n", - "Stick_Length 310437\n", - "Thumb 310366\n", - "Pattern_Changer 310437\n", - "Grouser_Type 310505\n", - "Backhoe_Mounting 331986\n", - "Blade_Type 330823\n", - "Travel_Controls 330821\n", - "Differential_Type 341134\n", - "Steering_Controls 341176\n", - "saleYear 0\n", - "saleMonth 0\n", - "saleDay 0\n", - "saleDayofweek 0\n", - "saleDayofyear 0\n", - "dtype: int64" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check for total missing values per column\n", - "df_tmp.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! It looks like our DataFrame has quite a few missing values.\n", - "\n", - "Not to worry, we can work on fixing this later on.\n", - "\n", - "How about we start by tring to turn all of our data in numbers? " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.1 Inspecting the datatypes in our DataFrame \n", - "\n", - "One way to help turn all of our data into numbers is to convert the columns with the `object` datatype into a `category` datatype using [`pandas.CategoricalDtype`](https://pandas.pydata.org/docs/reference/api/pandas.CategoricalDtype.html).\n", - "\n", - "> **Note:** There are many different ways to convert values into numbers. And often the best way will be specific to the value you're trying to convert. The method we're going to use, converting all objects (that are mostly strings) to categories is one of the faster methods as it makes a quick assumption that each unique value is its own number. \n", - "\n", - "We can check the datatype of an individual column using the `.dtype` attribute and we can get its full name using `.dtype.name`." - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(dtype('O'), 'object')" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the dtype of a given column\n", - "df_tmp[\"UsageBand\"].dtype, df_tmp[\"UsageBand\"].dtype.name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "Now we've got a way to check a column's datatype individually.\n", - "\n", - "There's also another group of methods to check a column's datatype directly.\n", - "\n", - "For example, using [`pd.api.types.is_object_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_object_dtype.html) we can get a boolean response as to whether the input is an object or not.\n", - "\n", - "> **Note:** There are many more of these checks you can perform for other datatypes such as strings under a similar name space `pd.api.types.is_XYZ_dtype`. See the [pandas documentation](https://pandas.pydata.org/docs/reference/arrays.html) for more.\n", - "\n", - "Let's see how it works on our `df_tmp[\"UsageBand\"]` column." - ] - }, - { - "cell_type": "code", - "execution_count": 34, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 34, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check whether a column is an object\n", - "pd.api.types.is_object_dtype(df_tmp[\"UsageBand\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also check whether a column is a string with [`pd.api.types.is_string_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_string_dtype.html). " - ] - }, - { - "cell_type": "code", - "execution_count": 35, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 35, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check whether a column is a string\n", - "pd.api.types.is_string_dtype(df_tmp[\"state\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We can even loop through the items (columns and their labels) in our DataFrame using [`pandas.DataFrame.items()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html) (in Python dictionary terms, calling `.items()` on a DataFrame will treat the column names as the keys and the column values as the values) and print out samples of columns which have the `string` datatype.\n", - "\n", - "As an extra check, passing the sample to [`pd.api.types.infer_dtype()`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.infer_dtype.html) will return the datatype of the sample.\n", - "\n", - "This will be a good way to keep exploring our data." - ] - }, - { - "cell_type": "code", - "execution_count": 36, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is a key: key1\n", - "This is a value: hello\n", - "This is a key: key2\n", - "This is a value: world!\n" - ] - } - ], - "source": [ - "# Quick exampke of calling .items() on a dictionary\n", - "random_dict = {\"key1\": \"hello\",\n", - " \"key2\": \"world!\"}\n", - "\n", - "for key, value in random_dict.items():\n", - " print(f\"This is a key: {key}\")\n", - " print(f\"This is a value: {value}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: fiModelDesc | Column dtype: object | Example value: ['35ZTS'] | Example value dtype: string\n", - "Column name: fiBaseModel | Column dtype: object | Example value: ['PC75'] | Example value dtype: string\n", - "Column name: fiProductClassDesc | Column dtype: object | Example value: ['Backhoe Loader - 14.0 to 15.0 Ft Standard Digging Depth'] | Example value dtype: string\n", - "Column name: state | Column dtype: object | Example value: ['Florida'] | Example value dtype: string\n", - "Column name: ProductGroup | Column dtype: object | Example value: ['TTT'] | Example value dtype: string\n", - "Column name: ProductGroupDesc | Column dtype: object | Example value: ['Track Excavators'] | Example value dtype: string\n" - ] - } - ], - "source": [ - "# Print column names and example content of columns which contain strings\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_string_dtype(content):\n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... it seems that there are many more columns in the `df_tmp` with the `object` type that didn't display when checking for the string datatype (we know there are many `object` datatype columns in our DataFrame from using `df_tmp.info()`).\n", - "\n", - "How about we try the same as above, except this time instead of `pd.api.types.is_string_dtype`, we use `pd.api.types.is_object_dtype`?\n", - "\n", - "Let's try it." - ] - }, - { - "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: UsageBand | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiModelDesc | Column dtype: object | Example value: ['590SUPER MII'] | Example value dtype: string\n", - "Column name: fiBaseModel | Column dtype: object | Example value: ['580'] | Example value dtype: string\n", - "Column name: fiSecondaryDesc | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiModelSeries | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiModelDescriptor | Column dtype: object | Example value: ['H'] | Example value dtype: string\n", - "Column name: ProductSize | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiProductClassDesc | Column dtype: object | Example value: ['Track Type Tractor, Dozer - 75.0 to 85.0 Horsepower'] | Example value dtype: string\n", - "Column name: state | Column dtype: object | Example value: ['Florida'] | Example value dtype: string\n", - "Column name: ProductGroup | Column dtype: object | Example value: ['TEX'] | Example value dtype: string\n", - "Column name: ProductGroupDesc | Column dtype: object | Example value: ['Skid Steer Loaders'] | Example value dtype: string\n", - "Column name: Drive_System | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Enclosure | Column dtype: object | Example value: ['EROPS'] | Example value dtype: string\n", - "Column name: Forks | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Pad_Type | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Ride_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Stick | Column dtype: object | Example value: ['Extended'] | Example value dtype: string\n", - "Column name: Transmission | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Turbocharged | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Blade_Extension | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Blade_Width | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Enclosure_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Engine_Horsepower | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Hydraulics | Column dtype: object | Example value: ['Standard'] | Example value dtype: string\n", - "Column name: Pushblock | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Ripper | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Scarifier | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Tip_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Tire_Size | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Coupler | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Coupler_System | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Grouser_Tracks | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Hydraulics_Flow | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Track_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Undercarriage_Pad_Width | Column dtype: object | Example value: ['20 inch'] | Example value dtype: string\n", - "Column name: Stick_Length | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Thumb | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Pattern_Changer | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Grouser_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Backhoe_Mounting | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Blade_Type | Column dtype: object | Example value: ['Straight'] | Example value dtype: string\n", - "Column name: Travel_Controls | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Differential_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Steering_Controls | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "\n", - "[INFO] Total number of object type columns: 44\n" - ] - } - ], - "source": [ - "# Start a count of how many object type columns there are\n", - "number_of_object_type_columns = 0\n", - "\n", - "for label, content in df_tmp.items():\n", - " # Check to see if column is of object type (this will include the string columns)\n", - " if pd.api.types.is_object_dtype(content): \n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")\n", - "\n", - " number_of_object_type_columns += 1\n", - "\n", - "print(f\"\\n[INFO] Total number of object type columns: {number_of_object_type_columns}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful, looks like we've got sample outputs from all of the columns with the `object` datatype.\n", - "\n", - "It also looks like that many of random samples are missing values." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.2 Converting strings to categories with pandas \n", - "\n", - "In pandas, one way to convert object/string values to numerical values is to convert them to categories or more specifically, the `pd.CategoricalDtype` datatype.\n", - "\n", - "This datatype keeps the underlying data the same (e.g. doesn't change the string) but enables easy conversion to a numeric code using [`.cat.codes`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.codes.html).\n", - "\n", - "For example, the column `state` might have the values `'Alabama', 'Alaska', 'Arizona'...` and these could be mapped to numeric values `1, 2, 3...` respectively.\n", - "\n", - "To see this in action, let's first convert the object datatype columns to `\"category\"` datatype.\n", - "\n", - "We can do so by looping through the `.items()` of our DataFrame and reassigning each object datatype column using [`pandas.Series.astype(dtype=\"category\")`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 39, - "metadata": {}, - "outputs": [], - "source": [ - "# This will turn all of the object columns into category values\n", - "for label, content in df_tmp.items(): \n", - " if pd.api.types.is_object_dtype(content):\n", - " df_tmp[label] = df_tmp[label].astype(\"category\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful!\n", - "\n", - "Now let's check if it worked by calling `.info()` on our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Index: 412698 entries, 205615 to 409203\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64 \n", - " 8 UsageBand 73670 non-null category\n", - " 9 fiModelDesc 412698 non-null category\n", - " 10 fiBaseModel 412698 non-null category\n", - " 11 fiSecondaryDesc 271971 non-null category\n", - " 12 fiModelSeries 58667 non-null category\n", - " 13 fiModelDescriptor 74816 non-null category\n", - " 14 ProductSize 196093 non-null category\n", - " 15 fiProductClassDesc 412698 non-null category\n", - " 16 state 412698 non-null category\n", - " 17 ProductGroup 412698 non-null category\n", - " 18 ProductGroupDesc 412698 non-null category\n", - " 19 Drive_System 107087 non-null category\n", - " 20 Enclosure 412364 non-null category\n", - " 21 Forks 197715 non-null category\n", - " 22 Pad_Type 81096 non-null category\n", - " 23 Ride_Control 152728 non-null category\n", - " 24 Stick 81096 non-null category\n", - " 25 Transmission 188007 non-null category\n", - " 26 Turbocharged 81096 non-null category\n", - " 27 Blade_Extension 25983 non-null category\n", - " 28 Blade_Width 25983 non-null category\n", - " 29 Enclosure_Type 25983 non-null category\n", - " 30 Engine_Horsepower 25983 non-null category\n", - " 31 Hydraulics 330133 non-null category\n", - " 32 Pushblock 25983 non-null category\n", - " 33 Ripper 106945 non-null category\n", - " 34 Scarifier 25994 non-null category\n", - " 35 Tip_Control 25983 non-null category\n", - " 36 Tire_Size 97638 non-null category\n", - " 37 Coupler 220679 non-null category\n", - " 38 Coupler_System 44974 non-null category\n", - " 39 Grouser_Tracks 44875 non-null category\n", - " 40 Hydraulics_Flow 44875 non-null category\n", - " 41 Track_Type 102193 non-null category\n", - " 42 Undercarriage_Pad_Width 102916 non-null category\n", - " 43 Stick_Length 102261 non-null category\n", - " 44 Thumb 102332 non-null category\n", - " 45 Pattern_Changer 102261 non-null category\n", - " 46 Grouser_Type 102193 non-null category\n", - " 47 Backhoe_Mounting 80712 non-null category\n", - " 48 Blade_Type 81875 non-null category\n", - " 49 Travel_Controls 81877 non-null category\n", - " 50 Differential_Type 71564 non-null category\n", - " 51 Steering_Controls 71522 non-null category\n", - " 52 saleYear 412698 non-null int32 \n", - " 53 saleMonth 412698 non-null int32 \n", - " 54 saleDay 412698 non-null int32 \n", - " 55 saleDayofweek 412698 non-null int32 \n", - " 56 saleDayofyear 412698 non-null int32 \n", - "dtypes: category(44), float64(3), int32(5), int64(5)\n", - "memory usage: 55.4 MB\n" - ] - } - ], - "source": [ - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It looks like it worked!\n", - "\n", - "All of the object datatype columns now have the category datatype.\n", - "\n", - "We can inspect this on a single column using `pandas.Series.dtype`." - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "CategoricalDtype(categories=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',\n", - " 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',\n", - " 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',\n", - " 'Kentucky', 'Louisiana', 'Maine', 'Maryland',\n", - " 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',\n", - " 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',\n", - " 'New Jersey', 'New Mexico', 'New York', 'North Carolina',\n", - " 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',\n", - " 'Puerto Rico', 'Rhode Island', 'South Carolina',\n", - " 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah',\n", - " 'Vermont', 'Virginia', 'Washington', 'Washington DC',\n", - " 'West Virginia', 'Wisconsin', 'Wyoming'],\n", - ", ordered=False, categories_dtype=object)" - ] - }, - "execution_count": 41, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check the datatype of a single column\n", - "df_tmp.state.dtype" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent, notice how the column is now of type `pd.CategoricalDtype`.\n", - "\n", - "We can also access these categories using [`pandas.Series.cat.categories`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.categories.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',\n", - " 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',\n", - " 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',\n", - " 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',\n", - " 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',\n", - " 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',\n", - " 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',\n", - " 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',\n", - " 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah', 'Vermont',\n", - " 'Virginia', 'Washington', 'Washington DC', 'West Virginia', 'Wisconsin',\n", - " 'Wyoming'],\n", - " dtype='object')" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the category names of a given column\n", - "df_tmp.state.cat.categories" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we can get the category codes (the numeric values representing the category) using [`pandas.Series.cat.codes`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.codes.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "205615 43\n", - "274835 8\n", - "141296 8\n", - "212552 8\n", - "62755 8\n", - " ..\n", - "410879 4\n", - "412476 4\n", - "411927 4\n", - "407124 4\n", - "409203 4\n", - "Length: 412698, dtype: int8" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Inspect the category codes\n", - "df_tmp.state.cat.codes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This gives us a numeric representation of our object/string datatype columns." - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Target state category number 43 maps to: Texas\n" - ] - } - ], - "source": [ - "# Get example string using category number\n", - "target_state_cat_number = 43\n", - "target_state_cat_value = df_tmp.state.cat.categories[target_state_cat_number] \n", - "print(f\"[INFO] Target state category number {target_state_cat_number} maps to: {target_state_cat_value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Epic! \n", - "\n", - "All of our data is categorical and thus we can now turn the categories into numbers, however it's still missing values, not to worry though, we'll get to these shortly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.3 Saving our preprocessed data (part 1)\n", - "\n", - "Before we start doing any further preprocessing steps on our DataFrame, how about we save our current DataFrame to file so we could import it again later if necessary.\n", - "\n", - "Saving and updating your dataset as you go is common practice in machine learning problems. As your problem changes and evolves, the dataset you're working with will likely change too.\n", - "\n", - "Making checkpoints of your dataset is similar to making checkpoints of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [], - "source": [ - "# Save preprocessed data to file\n", - "df_tmp.to_csv(\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv\",\n", - " index=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've saved our preprocessed data to file, we can re-import it and make sure it's in the same format." - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
016467709500.01126363843413218.01974NaNNaNTD20...None or UnspecifiedStraightNone or UnspecifiedNaNNaN1989117117
1182151414000.011940891015013299.01980NaNNaNA66...NaNNaNNaNStandardConventional1989131131
2150513850000.01473654413913299.01978NaNNaND7G...None or UnspecifiedStraightNone or UnspecifiedNaNNaN1989131131
3167117416000.01327630859113299.01980NaNNaNA62...NaNNaNNaNStandardConventional1989131131
4132905622000.01336053408913299.01984NaNNaND3B...None or UnspecifiedPATLeverNaNNaN1989131131
\n", - "

5 rows × 57 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1646770 9500.0 1126363 8434 132 18.0 1974 \n", - "1 1821514 14000.0 1194089 10150 132 99.0 1980 \n", - "2 1505138 50000.0 1473654 4139 132 99.0 1978 \n", - "3 1671174 16000.0 1327630 8591 132 99.0 1980 \n", - "4 1329056 22000.0 1336053 4089 132 99.0 1984 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc ... Backhoe_Mounting \\\n", - "0 NaN NaN TD20 ... None or Unspecified \n", - "1 NaN NaN A66 ... NaN \n", - "2 NaN NaN D7G ... None or Unspecified \n", - "3 NaN NaN A62 ... NaN \n", - "4 NaN NaN D3B ... None or Unspecified \n", - "\n", - " Blade_Type Travel_Controls Differential_Type Steering_Controls \\\n", - "0 Straight None or Unspecified NaN NaN \n", - "1 NaN NaN Standard Conventional \n", - "2 Straight None or Unspecified NaN NaN \n", - "3 NaN NaN Standard Conventional \n", - "4 PAT Lever NaN NaN \n", - "\n", - " saleYear saleMonth saleDay saleDayofweek saleDayofyear \n", - "0 1989 1 17 1 17 \n", - "1 1989 1 31 1 31 \n", - "2 1989 1 31 1 31 \n", - "3 1989 1 31 1 31 \n", - "4 1989 1 31 1 31 \n", - "\n", - "[5 rows x 57 columns]" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Import preprocessed data to file\n", - "df_tmp = pd.read_csv(\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv\",\n", - " low_memory=False)\n", - "\n", - "df_tmp.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent, looking at the tale end (the far right side) our processed DataFrame has the columns we added to it (the extra data features) but it's still missing values.\n", - "\n", - "But if we check `df_tmp.info()`..." - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 fiModelDesc 412698 non-null object \n", - " 10 fiBaseModel 412698 non-null object \n", - " 11 fiSecondaryDesc 271971 non-null object \n", - " 12 fiModelSeries 58667 non-null object \n", - " 13 fiModelDescriptor 74816 non-null object \n", - " 14 ProductSize 196093 non-null object \n", - " 15 fiProductClassDesc 412698 non-null object \n", - " 16 state 412698 non-null object \n", - " 17 ProductGroup 412698 non-null object \n", - " 18 ProductGroupDesc 412698 non-null object \n", - " 19 Drive_System 107087 non-null object \n", - " 20 Enclosure 412364 non-null object \n", - " 21 Forks 197715 non-null object \n", - " 22 Pad_Type 81096 non-null object \n", - " 23 Ride_Control 152728 non-null object \n", - " 24 Stick 81096 non-null object \n", - " 25 Transmission 188007 non-null object \n", - " 26 Turbocharged 81096 non-null object \n", - " 27 Blade_Extension 25983 non-null object \n", - " 28 Blade_Width 25983 non-null object \n", - " 29 Enclosure_Type 25983 non-null object \n", - " 30 Engine_Horsepower 25983 non-null object \n", - " 31 Hydraulics 330133 non-null object \n", - " 32 Pushblock 25983 non-null object \n", - " 33 Ripper 106945 non-null object \n", - " 34 Scarifier 25994 non-null object \n", - " 35 Tip_Control 25983 non-null object \n", - " 36 Tire_Size 97638 non-null object \n", - " 37 Coupler 220679 non-null object \n", - " 38 Coupler_System 44974 non-null object \n", - " 39 Grouser_Tracks 44875 non-null object \n", - " 40 Hydraulics_Flow 44875 non-null object \n", - " 41 Track_Type 102193 non-null object \n", - " 42 Undercarriage_Pad_Width 102916 non-null object \n", - " 43 Stick_Length 102261 non-null object \n", - " 44 Thumb 102332 non-null object \n", - " 45 Pattern_Changer 102261 non-null object \n", - " 46 Grouser_Type 102193 non-null object \n", - " 47 Backhoe_Mounting 80712 non-null object \n", - " 48 Blade_Type 81875 non-null object \n", - " 49 Travel_Controls 81877 non-null object \n", - " 50 Differential_Type 71564 non-null object \n", - " 51 Steering_Controls 71522 non-null object \n", - " 52 saleYear 412698 non-null int64 \n", - " 53 saleMonth 412698 non-null int64 \n", - " 54 saleDay 412698 non-null int64 \n", - " 55 saleDayofweek 412698 non-null int64 \n", - " 56 saleDayofyear 412698 non-null int64 \n", - "dtypes: float64(3), int64(10), object(44)\n", - "memory usage: 179.5+ MB\n" - ] - } - ], - "source": [ - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... what happened here? \n", - "\n", - "Notice that all of the `category` datatype columns are back to the `object` datatype.\n", - "\n", - "This is strange since we already converted the `object` datatype columns to `category`.\n", - "\n", - "Well then why did they change back?\n", - "\n", - "This happens because of the limitations of the CSV (`.csv`) file format, it doesn't preserve data types, rather it stores all the values as strings.\n", - "\n", - "So when we read in a CSV, pandas defaults to interpreting strings as `object` datatypes.\n", - "\n", - "Not to worry though, we can easily convert them to the `category` datatype as we did before.\n", - "\n", - "> **Note:** If you'd like to retain the datatypes when saving your data, you can use file formats such as [`parquet`](https://pandas.pydata.org/docs/user_guide/io.html#parquet) (Apache Parquet) and [`feather`](https://pandas.pydata.org/docs/user_guide/io.html#feather). These filetypes have several advantages over CSV in terms of processing speeds and storage size. However, data stored in these formats is not human-readable so you won't be able to open the files and inspect them without specific tools. For more on different file formats in pandas, see the [IO tools documentation page](https://pandas.pydata.org/docs/user_guide/io.html#)." - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [], - "source": [ - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_object_dtype(content):\n", - " # Turn object columns into category datatype\n", - " df_tmp[label] = df_tmp[label].astype(\"category\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now if we wanted to preserve the datatypes of our data, we can save to `parquet` or `feather` format.\n", - "\n", - "Let's try using `parquet` format.\n", - "\n", - "To do so, we can use the [`pandas.DataFrame.to_parquet()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) method.\n", - "\n", - "Files in the `parquet` format typically have the file extension of `.parquet`." - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [], - "source": [ - "# To save to parquet format requires pyarrow or fastparquet (or both)\n", - "# Can install via `pip install pyarrow fastparquet`\n", - "df_tmp.to_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet\", \n", - " engine=\"auto\") # \"auto\" will automatically use pyarrow or fastparquet, defaulting to pyarrow first" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful! Now let's try importing our DataFrame from the `parquet` format and check it using `df_tmp.info()`." - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64 \n", - " 8 UsageBand 73670 non-null category\n", - " 9 fiModelDesc 412698 non-null category\n", - " 10 fiBaseModel 412698 non-null category\n", - " 11 fiSecondaryDesc 271971 non-null category\n", - " 12 fiModelSeries 58667 non-null category\n", - " 13 fiModelDescriptor 74816 non-null category\n", - " 14 ProductSize 196093 non-null category\n", - " 15 fiProductClassDesc 412698 non-null category\n", - " 16 state 412698 non-null category\n", - " 17 ProductGroup 412698 non-null category\n", - " 18 ProductGroupDesc 412698 non-null category\n", - " 19 Drive_System 107087 non-null category\n", - " 20 Enclosure 412364 non-null category\n", - " 21 Forks 197715 non-null category\n", - " 22 Pad_Type 81096 non-null category\n", - " 23 Ride_Control 152728 non-null category\n", - " 24 Stick 81096 non-null category\n", - " 25 Transmission 188007 non-null category\n", - " 26 Turbocharged 81096 non-null category\n", - " 27 Blade_Extension 25983 non-null category\n", - " 28 Blade_Width 25983 non-null category\n", - " 29 Enclosure_Type 25983 non-null category\n", - " 30 Engine_Horsepower 25983 non-null category\n", - " 31 Hydraulics 330133 non-null category\n", - " 32 Pushblock 25983 non-null category\n", - " 33 Ripper 106945 non-null category\n", - " 34 Scarifier 25994 non-null category\n", - " 35 Tip_Control 25983 non-null category\n", - " 36 Tire_Size 97638 non-null category\n", - " 37 Coupler 220679 non-null category\n", - " 38 Coupler_System 44974 non-null category\n", - " 39 Grouser_Tracks 44875 non-null category\n", - " 40 Hydraulics_Flow 44875 non-null category\n", - " 41 Track_Type 102193 non-null category\n", - " 42 Undercarriage_Pad_Width 102916 non-null category\n", - " 43 Stick_Length 102261 non-null category\n", - " 44 Thumb 102332 non-null category\n", - " 45 Pattern_Changer 102261 non-null category\n", - " 46 Grouser_Type 102193 non-null category\n", - " 47 Backhoe_Mounting 80712 non-null category\n", - " 48 Blade_Type 81875 non-null category\n", - " 49 Travel_Controls 81877 non-null category\n", - " 50 Differential_Type 71564 non-null category\n", - " 51 Steering_Controls 71522 non-null category\n", - " 52 saleYear 412698 non-null int64 \n", - " 53 saleMonth 412698 non-null int64 \n", - " 54 saleDay 412698 non-null int64 \n", - " 55 saleDayofweek 412698 non-null int64 \n", - " 56 saleDayofyear 412698 non-null int64 \n", - "dtypes: category(44), float64(3), int64(10)\n", - "memory usage: 60.1 MB\n" - ] - } - ], - "source": [ - "# Read in df_tmp from parquet format\n", - "df_tmp = pd.read_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet\",\n", - " engine=\"auto\")\n", - "\n", - "# Using parquet format, datatypes are preserved\n", - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice! Looks like using the `parquet` format preserved all of our datatypes.\n", - "\n", - "For more on the `parquet` and `feather` formats, be sure to check out the [pandas IO (input/output) documentation](https://pandas.pydata.org/docs/user_guide/io.html#parquet)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.4 Finding and filling missing values\n", - "\n", - "Let's remind ourselves of the missing values by getting the top 20 columns with the most missing values.\n", - "\n", - "We do so by summing the results of [`pandas.DataFrame.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) and then using [`sort_values(ascending=False)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) to showcase the rows with the most missing." - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Blade_Width 386715\n", - "Engine_Horsepower 386715\n", - "Tip_Control 386715\n", - "Pushblock 386715\n", - "Blade_Extension 386715\n", - "Enclosure_Type 386715\n", - "Scarifier 386704\n", - "Hydraulics_Flow 367823\n", - "Grouser_Tracks 367823\n", - "Coupler_System 367724\n", - "fiModelSeries 354031\n", - "Steering_Controls 341176\n", - "Differential_Type 341134\n", - "UsageBand 339028\n", - "fiModelDescriptor 337882\n", - "Backhoe_Mounting 331986\n", - "Stick 331602\n", - "Turbocharged 331602\n", - "Pad_Type 331602\n", - "Blade_Type 330823\n", - "dtype: int64" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check missing values\n", - "df_tmp.isna().sum().sort_values(ascending=False)[:20]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, it seems like there are a fair few columns with missing values and there are several datatypes across these columns (numerical, categorical).\n", - "\n", - "How about we break the problem down and work on filling each datatype separately?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.5 Filling missing numerical values\n", - "\n", - "There's no set way to fill missing values in your dataset.\n", - "\n", - "And unless you're filling the missing samples with newly discovered actual data, every way you fill your dataset's missing values will introduce some sort of noise or bias. \n", - "\n", - "We'll start by filling the missing numerical values in ourdataet.\n", - "\n", - "To do this, we'll first find the numeric datatype columns.\n", - "\n", - "We can do by looping through the columns in our DataFrame and calling [`pd.api.types.is_numeric_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_numeric_dtype.html) on them." - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Column dtype: int64 | Example value: [1748586] | Example value dtype: integer\n", - "Column name: SalePrice | Column dtype: float64 | Example value: [13000.] | Example value dtype: floating\n", - "Column name: MachineID | Column dtype: int64 | Example value: [1441940] | Example value dtype: integer\n", - "Column name: ModelID | Column dtype: int64 | Example value: [1333] | Example value dtype: integer\n", - "Column name: datasource | Column dtype: int64 | Example value: [132] | Example value dtype: integer\n", - "Column name: auctioneerID | Column dtype: float64 | Example value: [2.] | Example value dtype: floating\n", - "Column name: YearMade | Column dtype: int64 | Example value: [1000] | Example value dtype: integer\n", - "Column name: MachineHoursCurrentMeter | Column dtype: float64 | Example value: [nan] | Example value dtype: floating\n", - "Column name: saleYear | Column dtype: int64 | Example value: [2010] | Example value dtype: integer\n", - "Column name: saleMonth | Column dtype: int64 | Example value: [6] | Example value dtype: integer\n", - "Column name: saleDay | Column dtype: int64 | Example value: [16] | Example value dtype: integer\n", - "Column name: saleDayofweek | Column dtype: int64 | Example value: [3] | Example value dtype: integer\n", - "Column name: saleDayofyear | Column dtype: int64 | Example value: [285] | Example value dtype: integer\n" - ] - } - ], - "source": [ - "# Find numeric columns \n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! Looks like we've got a mixture of `int64` and `float64` numerical datatypes.\n", - "\n", - "Now how about we find out which numeric columns are missing values?\n", - "\n", - "We can do so by using `pandas.isnull(obj).sum()` to detect and sum the missing values in a given array-like object (in our case, the data in a target column).\n", - "\n", - "Let's loop through our DataFrame columns, find the numeric datatypes and check if they have any missing values." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Has missing values: False\n", - "Column name: SalePrice | Has missing values: False\n", - "Column name: MachineID | Has missing values: False\n", - "Column name: ModelID | Has missing values: False\n", - "Column name: datasource | Has missing values: False\n", - "Column name: auctioneerID | Has missing values: True\n", - "Column name: YearMade | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter | Has missing values: True\n", - "Column name: saleYear | Has missing values: False\n", - "Column name: saleMonth | Has missing values: False\n", - "Column name: saleDay | Has missing values: False\n", - "Column name: saleDayofweek | Has missing values: False\n", - "Column name: saleDayofyear | Has missing values: False\n" - ] - } - ], - "source": [ - "# Check for which numeric columns have null values\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " print(f\"Column name: {label} | Has missing values: {True}\")\n", - " else:\n", - " print(f\"Column name: {label} | Has missing values: {False}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay, it looks like our `auctioneerID` and `MachineHoursCurrentMeter` columns have missing numeric values.\n", - "\n", - "Let's have a look at how we might handle these." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.6 Discussing possible ways to handle missing values\n", - "\n", - "As previously discussed, there are many ways to fill missing values.\n", - "\n", - "For missing numeric values, some potential options are:\n", - "\n", - "| **Method** | **Pros** | **Cons** |\n", - "|-----|-----|-----|\n", - "| **Fill with mean of column** | - Easy to calculate/implement
- Retains overall data distribution | - Averages out variation
- Affected by outliers (e.g. if one value is much higher/lower than others) |\n", - "| **Fill with median of column** | - Easy to calculate/implement
- Robust to outliers
- Preserves center of data | - Ignores data distribution shape |\n", - "| **Fill with mode of column** | - Easy to calculate/implement
- More useful for categorical-like data | - May not make sense for continuous/numerical data |\n", - "| **Fill with 0 (or another constant)** | - Simple to implement
- Useful in certain contexts like counts | - Introduces bias (e.g. if 0 was a value that meant something)
- Skews data (e.g. if many missing values, replacing all with 0 makes it look like that's the most common value) |\n", - "| **Forward/Backward fill (use previous/future values to fill future/previous values)** | - Maintains temporal continuity (for time series) | - Assumes data is continuous, which may not be valid |\n", - "| **Use a calculation from other columns** | - Takes existing information and reinterprets it | - Can result in unlikely outputs if calculations are not continuous | \n", - "| **Interpolate (e.g. like dragging a cell in Excel/Google Sheets)** | - Captures trends
- Suitable for ordered data | - Can introduce errors
- May assume linearity (data continues in a straight line) |\n", - "| **Drop missing values** | - Ensures complete data (only use samples with all information)
- Useful for small datasets | - Can result in data loss (e.g. if many missing values are scattered across columns, data size can be dramatically reduced)
- Reduces dataset size |\n", - "\n", - "Which method you choose will be dataset and problem dependant and will likely require several phases of experimentation to see what works and what doesn't.\n", - "\n", - "For now, we'll fill our missing numeric values with the median value of the target column.\n", - "\n", - "We'll also add a binary column (0 or 1) with rows reflecting whether or not a value was missing.\n", - "\n", - "For example, `MachineHoursCurrentMeter_is_missing` will be a column with rows which have a value of `0` if that row's `MachineHoursCurrentMeter` column was *not* missing and `1` if it was.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [], - "source": [ - "# Fill missing numeric values with the median of the target column\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " \n", - " # Add a binary column which tells if the data was missing our not\n", - " df_tmp[label+\"_is_missing\"] = pd.isnull(content).astype(int) # this will add a 0 or 1 value to rows with missing values (e.g. 0 = not missing, 1 = missing)\n", - "\n", - " # Fill missing numeric values with median since it's more robust than the mean\n", - " df_tmp[label] = content.fillna(content.median())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Why add a binary column indicating whether the data was missing or not?\n", - "\n", - "We can easily fill all of the missing numeric values in our dataset with the median. \n", - "\n", - "However, a numeric value may be missing for a reason. \n", - "\n", - "Adding a binary column which indicates whether the value was missing or not helps to retain this information. It also means we can inspect these rows later on." - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Travel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyearauctioneerID_is_missingMachineHoursCurrentMeter_is_missing
150110163153135000.0126745647941322.019980.0NaN710D...NaNNaNNaN2003911325401
111297132765815000.01185021411213299.019800.0NaND5B...None or UnspecifiedNaNNaN200158112801
177121143217952000.078865412631321.019970.0NaN330BL...NaNNaNNaN200521514601
138512144017927000.079057735471327.019990.0NaN426C...NaNNaNNaN2002126434001
69375147390167500.019653038231326.019910.0NaN950F...NaNStandardConventional1998827323901
\n", - "

5 rows × 59 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "150110 1631531 35000.0 1267456 4794 132 2.0 \n", - "111297 1327658 15000.0 1185021 4112 132 99.0 \n", - "177121 1432179 52000.0 788654 1263 132 1.0 \n", - "138512 1440179 27000.0 790577 3547 132 7.0 \n", - "69375 1473901 67500.0 196530 3823 132 6.0 \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "150110 1998 0.0 NaN 710D ... \n", - "111297 1980 0.0 NaN D5B ... \n", - "177121 1997 0.0 NaN 330BL ... \n", - "138512 1999 0.0 NaN 426C ... \n", - "69375 1991 0.0 NaN 950F ... \n", - "\n", - " Travel_Controls Differential_Type Steering_Controls saleYear \\\n", - "150110 NaN NaN NaN 2003 \n", - "111297 None or Unspecified NaN NaN 2001 \n", - "177121 NaN NaN NaN 2005 \n", - "138512 NaN NaN NaN 2002 \n", - "69375 NaN Standard Conventional 1998 \n", - "\n", - " saleMonth saleDay saleDayofweek saleDayofyear auctioneerID_is_missing \\\n", - "150110 9 11 3 254 0 \n", - "111297 5 8 1 128 0 \n", - "177121 2 15 1 46 0 \n", - "138512 12 6 4 340 0 \n", - "69375 8 27 3 239 0 \n", - "\n", - " MachineHoursCurrentMeter_is_missing \n", - "150110 1 \n", - "111297 1 \n", - "177121 1 \n", - "138512 1 \n", - "69375 1 \n", - "\n", - "[5 rows x 59 columns]" - ] - }, - "execution_count": 55, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show rows where MachineHoursCurrentMeter_is_missing == 1\n", - "df_tmp[df_tmp[\"MachineHoursCurrentMeter_is_missing\"] == 1].sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Missing numeric values filled!\n", - "\n", - "How about we check again whether or not the numeric columns have missing values?" - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Has missing values: False\n", - "Column name: SalePrice | Has missing values: False\n", - "Column name: MachineID | Has missing values: False\n", - "Column name: ModelID | Has missing values: False\n", - "Column name: datasource | Has missing values: False\n", - "Column name: auctioneerID | Has missing values: False\n", - "Column name: YearMade | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter | Has missing values: False\n", - "Column name: saleYear | Has missing values: False\n", - "Column name: saleMonth | Has missing values: False\n", - "Column name: saleDay | Has missing values: False\n", - "Column name: saleDayofweek | Has missing values: False\n", - "Column name: saleDayofyear | Has missing values: False\n", - "Column name: auctioneerID_is_missing | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter_is_missing | Has missing values: False\n" - ] - } - ], - "source": [ - "# Check for which numeric columns have null values\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " print(f\"Column name: {label} | Has missing values: {True}\")\n", - " else:\n", - " print(f\"Column name: {label} | Has missing values: {False}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woohoo! Numeric missing values filled!\n", - "\n", - "And thanks to our binary `_is_missing` columns, we can even check how many were missing." - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "auctioneerID_is_missing\n", - "0 392562\n", - "1 20136\n", - "Name: count, dtype: int64" - ] - }, - "execution_count": 57, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check to see how many examples in the auctioneerID were missing\n", - "df_tmp.auctioneerID_is_missing.value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.7 Filling missing categorical values with pandas\n", - "\n", - "Now we've filled the numeric values, we'll do the same with the categorical values whilst ensuring that they are all numerical too.\n", - "\n", - "Let's first investigate the columns which *aren't* numeric (we've already worked with these). " - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Columns which are not numeric:\n", - "Column name: UsageBand | Column dtype: category\n", - "Column name: fiModelDesc | Column dtype: category\n", - "Column name: fiBaseModel | Column dtype: category\n", - "Column name: fiSecondaryDesc | Column dtype: category\n", - "Column name: fiModelSeries | Column dtype: category\n", - "Column name: fiModelDescriptor | Column dtype: category\n", - "Column name: ProductSize | Column dtype: category\n", - "Column name: fiProductClassDesc | Column dtype: category\n", - "Column name: state | Column dtype: category\n", - "Column name: ProductGroup | Column dtype: category\n", - "Column name: ProductGroupDesc | Column dtype: category\n", - "Column name: Drive_System | Column dtype: category\n", - "Column name: Enclosure | Column dtype: category\n", - "Column name: Forks | Column dtype: category\n", - "Column name: Pad_Type | Column dtype: category\n", - "Column name: Ride_Control | Column dtype: category\n", - "Column name: Stick | Column dtype: category\n", - "Column name: Transmission | Column dtype: category\n", - "Column name: Turbocharged | Column dtype: category\n", - "Column name: Blade_Extension | Column dtype: category\n", - "Column name: Blade_Width | Column dtype: category\n", - "Column name: Enclosure_Type | Column dtype: category\n", - "Column name: Engine_Horsepower | Column dtype: category\n", - "Column name: Hydraulics | Column dtype: category\n", - "Column name: Pushblock | Column dtype: category\n", - "Column name: Ripper | Column dtype: category\n", - "Column name: Scarifier | Column dtype: category\n", - "Column name: Tip_Control | Column dtype: category\n", - "Column name: Tire_Size | Column dtype: category\n", - "Column name: Coupler | Column dtype: category\n", - "Column name: Coupler_System | Column dtype: category\n", - "Column name: Grouser_Tracks | Column dtype: category\n", - "Column name: Hydraulics_Flow | Column dtype: category\n", - "Column name: Track_Type | Column dtype: category\n", - "Column name: Undercarriage_Pad_Width | Column dtype: category\n", - "Column name: Stick_Length | Column dtype: category\n", - "Column name: Thumb | Column dtype: category\n", - "Column name: Pattern_Changer | Column dtype: category\n", - "Column name: Grouser_Type | Column dtype: category\n", - "Column name: Backhoe_Mounting | Column dtype: category\n", - "Column name: Blade_Type | Column dtype: category\n", - "Column name: Travel_Controls | Column dtype: category\n", - "Column name: Differential_Type | Column dtype: category\n", - "Column name: Steering_Controls | Column dtype: category\n" - ] - } - ], - "source": [ - "# Check columns which aren't numeric\n", - "print(f\"[INFO] Columns which are not numeric:\")\n", - "for label, content in df_tmp.items():\n", - " if not pd.api.types.is_numeric_dtype(content):\n", - " print(f\"Column name: {label} | Column dtype: {df_tmp[label].dtype.name}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay, we've got plenty of category type columns.\n", - "\n", - "Let's now write some code to fill the missing categorical values as well as ensure they are numerical (non-string). \n", - "\n", - "To do so, we'll:\n", - "\n", - "1. Create a blank column to category dictionary, we'll use this to store categorical value names (e.g. their string name) as well as their categorical code. We'll end with a dictionary of dictionaries in the form `{\"column_name\": {category_code: \"category_value\"...}...}`.\n", - "2. Loop through the items in the DataFrame.\n", - "3. Check if the column is numeric or not.\n", - "4. Add a binary column in the form `ORIGINAL_COLUMN_NAME_is_missing` with a `0` or `1` value for if the row had a missing value.\n", - "5. Ensure the column values are in the `pd.Categorical` datatype and get their category codes with `pd.Series.cat.codes` (we'll add `1` to these values since pandas defaults to assigning `-1` to `NaN` values, we'll use `0` instead).\n", - "6. Turn the column categories and column category codes from 5 into a dictionary with Python's [`dict(zip(category_names, category_codes))`](https://docs.python.org/3.3/library/functions.html#zip) and save this to the blank dictionary from 1 with the target column name as key.\n", - "7. Set the target column value to the numerical category values from 5.\n", - "\n", - "Phew!\n", - "\n", - "That's a fair few steps but nothing we can't handle.\n", - "\n", - "Let's do it!" - ] - }, - { - "cell_type": "code", - "execution_count": 59, - "metadata": {}, - "outputs": [], - "source": [ - "# 1. Create a dictionary to store column to category values (e.g. we turn our category types into numbers but we keep a record so we can go back)\n", - "column_to_category_dict = {} \n", - "\n", - "# 2. Turn categorical variables into numbers\n", - "for label, content in df_tmp.items():\n", - "\n", - " # 3. Check columns which *aren't* numeric\n", - " if not pd.api.types.is_numeric_dtype(content):\n", - "\n", - " # 4. Add binary column to inidicate whether sample had missing value\n", - " df_tmp[label+\"_is_missing\"] = pd.isnull(content).astype(int)\n", - "\n", - " # 5. Ensure content is categorical and get its category codes\n", - " content_categories = pd.Categorical(content)\n", - " content_category_codes = content_categories.codes + 1 # prevents -1 (the default for NaN values) from being used for missing values (we'll treat missing values as 0)\n", - "\n", - " # 6. Add column key to dictionary with code: category mapping per column\n", - " column_to_category_dict[label] = dict(zip(content_category_codes, content_categories))\n", - " \n", - " # 7. Set the column to the numerical values (the category code value) \n", - " df_tmp[label] = content_category_codes " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ho ho! No errors!\n", - "\n", - "Let's check out a few random samples of our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Undercarriage_Pad_Width_is_missingStick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missing
232167241266053000.011447296071361.020000.004823...1111111100
398100122174618500.0104724527591213.01000319.022224...1111100011
363820250255931000.0133354231721491.020071149.031081...1111111111
32223024327527500.01537457360331361.020030.004259...1111111111
10401135658126000.0139493340901321.019880.002121...1111100011
\n", - "

5 rows × 103 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "232167 2412660 53000.0 1144729 607 136 1.0 \n", - "398100 1221746 18500.0 1047245 2759 121 3.0 \n", - "363820 2502559 31000.0 1333542 3172 149 1.0 \n", - "322230 2432752 7500.0 1537457 36033 136 1.0 \n", - "10401 1356581 26000.0 1394933 4090 132 1.0 \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "232167 2000 0.0 0 4823 ... \n", - "398100 1000 319.0 2 2224 ... \n", - "363820 2007 1149.0 3 1081 ... \n", - "322230 2003 0.0 0 4259 ... \n", - "10401 1988 0.0 0 2121 ... \n", - "\n", - " Undercarriage_Pad_Width_is_missing Stick_Length_is_missing \\\n", - "232167 1 1 \n", - "398100 1 1 \n", - "363820 1 1 \n", - "322230 1 1 \n", - "10401 1 1 \n", - "\n", - " Thumb_is_missing Pattern_Changer_is_missing Grouser_Type_is_missing \\\n", - "232167 1 1 1 \n", - "398100 1 1 1 \n", - "363820 1 1 1 \n", - "322230 1 1 1 \n", - "10401 1 1 1 \n", - "\n", - " Backhoe_Mounting_is_missing Blade_Type_is_missing \\\n", - "232167 1 1 \n", - "398100 0 0 \n", - "363820 1 1 \n", - "322230 1 1 \n", - "10401 0 0 \n", - "\n", - " Travel_Controls_is_missing Differential_Type_is_missing \\\n", - "232167 1 0 \n", - "398100 0 1 \n", - "363820 1 1 \n", - "322230 1 1 \n", - "10401 0 1 \n", - "\n", - " Steering_Controls_is_missing \n", - "232167 0 \n", - "398100 1 \n", - "363820 1 \n", - "322230 1 \n", - "10401 1 \n", - "\n", - "[5 rows x 103 columns]" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_tmp.sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! Looks like our data is all in numerical form.\n", - "\n", - "How about we investigate an item from our `column_to_category_dict`?\n", - "\n", - "This will show the mapping from numerical value to category (most likely a string) value." - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0 -> nan\n", - "1 -> High\n", - "2 -> Low\n", - "3 -> Medium\n" - ] - } - ], - "source": [ - "# Check the UsageBand (measure of bulldozer usage)\n", - "for key, value in sorted(column_to_category_dict[\"UsageBand\"].items()): # note: calling sorted() on dictionary.items() sorts the dictionary by keys \n", - " print(f\"{key} -> {value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> **Note:** Categorical values do not necessarily have order. They are strictly a mapping from number to value. In this case, our categorical values are mapped in numerical order. If you feel that the order of a value may influence a model in a negative way (e.g. `1 -> High` is *lower* than `3 -> Medium` but should be *higher*), you may want to look into ordering the values in a particular way or using a different numerical encoding technique such as [one-hot encoding](https://en.wikipedia.org/wiki/One-hot).\n", - "\n", - "And we can do the same for the `state` column values." - ] - }, - { - "cell_type": "code", - "execution_count": 62, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1 -> Alabama\n", - "2 -> Alaska\n", - "3 -> Arizona\n", - "4 -> Arkansas\n", - "5 -> California\n", - "6 -> Colorado\n", - "7 -> Connecticut\n", - "8 -> Delaware\n", - "9 -> Florida\n", - "10 -> Georgia\n" - ] - } - ], - "source": [ - "# Check the first 10 state column values\n", - "for key, value in sorted(column_to_category_dict[\"state\"].items())[:10]:\n", - " print(f\"{key} -> {value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "How about we check to see all of the missing values have been filled?" - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Total missing values: 0 - Woohoo! Let's build a model!\n" - ] - } - ], - "source": [ - "# Check total number of missing values\n", - "total_missing_values = df_tmp.isna().sum().sum()\n", - "\n", - "if total_missing_values == 0:\n", - " print(f\"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!\")\n", - "else:\n", - " print(f\"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.8 Saving our preprocessed data (part 2)\n", - "\n", - "One more step before we train new model!\n", - "\n", - "Let's save our work so far so we could re-import our preprocessed dataset if we wanted to.\n", - "\n", - "We'll save it to the `parquet` format again, this time with a suffix to show we've filled the missing values." - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [], - "source": [ - "# Save preprocessed data with object values as categories as well as missing values filled\n", - "df_tmp.to_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet\",\n", - " engine=\"auto\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And to make sure it worked, we can re-import it." - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [], - "source": [ - "# Read in preprocessed dataset\n", - "df_tmp = pd.read_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet\",\n", - " engine=\"auto\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Does it have any missing values?" - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Total missing values: 0 - Woohoo! Let's build a model!\n" - ] - } - ], - "source": [ - "# Check total number of missing values\n", - "total_missing_values = df_tmp.isna().sum().sum()\n", - "\n", - "if total_missing_values == 0:\n", - " print(f\"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!\")\n", - "else:\n", - " print(f\"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Checkpoint reached!\n", - "\n", - "We've turned all of our data into numbers as well as filled the missing values, time to try fitting a model to it again." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.9 Fitting a machine learning model to our preprocessed data\n", - "\n", - "Now all of our data is numeric and there are no missing values, we should be able to fit a machine learning model to it!\n", - "\n", - "Let's reinstantiate our trusty [`sklearn.ensemble.RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model.\n", - "\n", - "Since our dataset has a substantial amount of rows (~400k+), let's first make sure the model will work on a smaller sample of 1000 or so.\n", - "\n", - "> **Note:** It's common practice on machine learning problems to see if your experiments will work on smaller scale problems (e.g. smaller amounts of data) before scaling them up to the full dataset. This practice enables you to try many different kinds of experiments with faster runtimes. The benefit of this is that you can figure out what doesn't work before spending more time on what does.\n", - "\n", - "Our `X` values (features) will be every column except the `SalePrice` column.\n", - "\n", - "And our `y` values (labels) will be the entirety of the `SalePrice` column.\n", - "\n", - "\n", - "We'll time how long our smaller experiment takes using the [magic function `%%time`](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html) and placing it at the top of the notebook cell.\n", - "\n", - "> **Note:** You can find out more about the `%%time` magic command by typing `%%time?` (note the question mark on the end) in a notebook cell.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 1.06 s, sys: 2.37 s, total: 3.43 s\n", - "Wall time: 976 ms\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(n_jobs=-1)" - ] - }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Sample 1000 samples with random state 42 for reproducibility\n", - "df_tmp_sample_1k = df_tmp.sample(n=1000, random_state=42)\n", - "\n", - "# Instantiate a model\n", - "model = RandomForestRegressor(n_jobs=-1) # use -1 to utilise all available processors\n", - "\n", - "# Create features and labels\n", - "X_sample_1k = df_tmp_sample_1k.drop(\"SalePrice\", axis=1) # use all columns except SalePrice as X values\n", - "y_sample_1k = df_tmp_sample_1k[\"SalePrice\"] # use SalePrice as y values (target variable)\n", - "\n", - "# Fit the model to the sample data\n", - "model.fit(X=X_sample_1k, \n", - " y=y_sample_1k) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! It looks like things worked!\n", - "\n", - "And quite quick too (since we're only using a relatively small number of rows).\n", - "\n", - "How about we score our model?\n", - "\n", - "We can do so using the built-in method `score()`. \n", - "\n", - "By default, `sklearn.ensemble.RandomForestRegressor` uses [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) ($R^2$ or R-squared) as the evaluation metric (higher is better, with a score of 1.0 being perfect)." - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model score on 1000 samples: 0.9563062437082765\n" - ] - } - ], - "source": [ - "# Evaluate the model\n", - "model_sample_1k_score = model.score(X=X_sample_1k,\n", - " y=y_sample_1k)\n", - "\n", - "print(f\"[INFO] Model score on {len(df_tmp_sample_1k)} samples: {model_sample_1k_score}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wow, it looks like our model got a pretty good score on only 1000 samples (the best possible score it could achieve would've been 1.0). \n", - "\n", - "How about we try our model on the whole dataset?" - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 10min 21s, sys: 8min 31s, total: 18min 53s\n", - "Wall time: 3min 24s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(n_jobs=-1)" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Instantiate model\n", - "model = RandomForestRegressor(n_jobs=-1) # note: this could take quite a while depending on your machine (it took ~1.5 minutes on my MacBook Pro M1 Pro with 10 cores)\n", - "\n", - "# Create features and labels with entire dataset\n", - "X_all = df_tmp.drop(\"SalePrice\", axis=1)\n", - "y_all = df_tmp[\"SalePrice\"]\n", - "\n", - "# Fit the model\n", - "model.fit(X=X_all, \n", - " y=y_all)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, that took a little bit longer than fitting on 1000 samples (but that's too be expected, as many more calculations had to be made).\n", - "\n", - "There's a reason we used `n_jobs=-1` too.\n", - "\n", - "If we stuck with the default of `n_jobs=None` (the same as `n_jobs=1`), it would've taken much longer.\n", - "\n", - "| Configuration (MacBook Pro M1 Pro, 10 Cores) | CPU Times (User) | CPU Times (Sys) | CPU Times (Total) | Wall Time |\n", - "|-----|-----|-----|-----|-----|\n", - "| `n_jobs=-1` (all cores) | 9min 14s | 3.85s | 9min 18s | 1min 15s |\n", - "| `n_jobs=None` (default) | 7min 14s | 1.75s | 7min 16s | 7min 25s |\n", - "\n", - "And as we've discussed many times, one of the main goals when starting a machine learning project is to reduce your time between experiments.\n", - "\n", - "How about we score the model trained on all of the data?" - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model score on 412698 samples: 0.9875710658782831\n" - ] - } - ], - "source": [ - "# Evaluate the model\n", - "model_sample_all_score = model.score(X=X_all,\n", - " y=y_all)\n", - "\n", - "print(f\"[INFO] Model score on {len(df_tmp)} samples: {model_sample_all_score}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An even better score!\n", - "\n", - "Oh wait...\n", - "\n", - "Oh no...\n", - "\n", - "I think we've got an error... (you might've noticed it already)\n", - "\n", - "Why might this metric be unreliable?\n", - "\n", - "Hint: Compare the data we trained on versus the data we evaluated on." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.10 A big (but fixable) mistake \n", - "\n", - "One of the hard things about bugs in machine learning projects is that they are often silent.\n", - "\n", - "For example, our model seems to have fit the data with no issues and then evaluated with a good score.\n", - "\n", - "So what's wrong?\n", - "\n", - "It seems we've stumbled across one of the most common bugs in machine learning and that's **data leakage** (data from the training set leaking into the validation/testing sets).\n", - "\n", - "We've evaluated our model on the same data it was trained on.\n", - "\n", - "This isn't the model's fault either.\n", - "\n", - "It's our fault.\n", - "\n", - "Right back at the start we imported a file called `TrainAndValid.csv`, this file contains both the training and validation data.\n", - "\n", - "And while we preprocessed it to make sure there were no missing values and the samples were all numeric, we never split the data into separate training and validation splits.\n", - "\n", - "The right workflow would've been to train the model on the training split and then evaluate it on the *unseen* and *separate* validation split.\n", - "\n", - "Our evaluation scores above are quite good but they can't necessarily be trusted to be replicated on unseen data (data in the real world) because they've been obtained by evaluating the model on data its already seen during training. \n", - "\n", - "This would be the equivalent of a final exam at university containing all of the same questions as the practice exam without any changes, you may get a good grade, but does that good grade translate to the real world?\n", - "\n", - "Not to worry, we can fix this!\n", - "\n", - "How?\n", - "\n", - "We can import the training and validation datasets separately via `Train.csv` and `Valid.csv` respectively.\n", - "\n", - "Or we could import `TrainAndValid.csv` and perform the appropriate splits according the original [Kaggle competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/data) (training data includes all samples prior to 2012 and validation data includes samples from January 1 2012 to April 30 2012).\n", - "\n", - "In both methods, we'll have to perform similar preprocessing steps we've done so far.\n", - "\n", - "Except because the validation data is supposed to remain as *unseen* data, we'll only use information from the training set to preprocess the validation set (and not mix the two). \n", - "\n", - "We'll work on this in the subsequent sections.\n", - "\n", - "The takeaway?\n", - "\n", - "Always (if possible) **create appropriate data splits at the start of a project**.\n", - "\n", - "Because it's one thing to train a machine learning model but if you can't evaluate it properly (on unseen data), how can you know how it'll perform (or may perform) in the real world on new and unseen data?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Splitting data into the right train/validation sets\n", - "\n", - "The bad news is, we evaluated our model on the same data we trained it on.\n", - "\n", - "The good news is, we get to practice importing and preprocessing our data again. \n", - "\n", - "This time we'll make sure we've got separate training and validation splits. \n", - "\n", - "Previously, we used pandas to ensure our data was all numeric and had no missing values. \n", - "\n", - "And we can still use pandas for things such as creating/altering date-related columns.\n", - "\n", - "But using pandas for all of our data preprocessing can be an issue with larger scale datasets or when new data is introduced. \n", - "\n", - "How about this time we add Scikit-Learn to the mix and make a reproducible pipeline for our data preprocessing needs?\n", - "\n", - "> **Note:** Scikit-Learn has a fantastic guide on [data transformations](https://scikit-learn.org/1.5/data_transforms.html) and in particular [data preprocessing](https://scikit-learn.org/1.5/modules/preprocessing.html). I'd highly recommend spending an hour or so reading through this documentation, even if it doesn't make a lot of sense to begin with. Rest assured, with practice and experimentation you'll start to get the hang of it.\n", - "\n", - "According to the [Kaggle data page](https://www.kaggle.com/c/bluebook-for-bulldozers/data), the train, validation and test sets are split according to dates.\n", - "\n", - "This makes sense since we're working on a time series problem (using past sale prices to try and predict future sale prices).\n", - "\n", - "Knowing this, randomly splitting our data into train, validation and test sets using something like [`sklearn.model_selection.train_test_split()`](https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html) wouldn't work as this would mix samples from different dates in an unintended way.\n", - "\n", - "Instead, we split our data into training, validation and test sets using the date each sample occured.\n", - "\n", - "In our case:\n", - "\n", - "* Training data (`Train.csv`) = all samples up until 2011.\n", - "* Validation data (`Valid.csv`) = all samples form January 1, 2012 - April 30, 2012.\n", - "* Testing data (`Test.csv`) = all samples from May 1, 2012 - November 2012.\n", - "\n", - "Previously we imported `TrainAndValid.csv` which is a combination of `Train.csv` and `Valid.csv` in one file.\n", - "\n", - "We could split this based on the `saledate` column.\n", - "\n", - "However, we could also import the `Train.csv` and `Valid.csv` files separately (we'll import `Test.csv` later on when we've trained a model).\n", - "\n", - "We'll also import `ValidSolution.csv` which contains the `SalePrice` of `Valid.csv` and make sure we match the columns based on the `SalesID` key.\n", - "\n", - "> **Note:** For more on making good training, validation and test sets, check out the post [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) by Rachel Thomas as well as [The importance of a test set](https://www.learnml.io/posts/the-importance-of-a-test-set/) by Daniel Bourke." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Number of samples in training DataFrame: 401125\n", - "[INFO] Number of samples in validation DataFrame: 11573\n" - ] - } - ], - "source": [ - "# Import train samples (making sure to parse dates and then sort by them)\n", - "train_df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/Train.csv\",\n", - " parse_dates=[\"saledate\"],\n", - " low_memory=False).sort_values(by=\"saledate\", ascending=True)\n", - "\n", - "# Import validation samples (making sure to parse dates and then sort by them)\n", - "valid_df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/Valid.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "\n", - "# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv\n", - "valid_solution = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/ValidSolution.csv\")\n", - "\n", - "# Map valid_solution to valid_df\n", - "valid_df[\"SalePrice\"] = valid_df[\"SalesID\"].map(valid_solution.set_index(\"SalesID\")[\"SalePrice\"])\n", - "\n", - "# Make sure valid_df is sorted by saledate still\n", - "valid_df = valid_df.sort_values(\"saledate\", ascending=True).reset_index(drop=True)\n", - "\n", - "# How many samples are in each DataFrame?\n", - "print(f\"[INFO] Number of samples in training DataFrame: {len(train_df)}\")\n", - "print(f\"[INFO] Number of samples in validation DataFrame: {len(valid_df)}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandsaledate...Undercarriage_Pad_WidthStick_LengthThumbPattern_ChangerGrouser_TypeBackhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_Controls
118276145733358000143414641471321.01980NaNNaN1990-05-03...NaNNaNNaNNaNNaNNone or UnspecifiedStraightNone or UnspecifiedNaNNaN
149220152245737000147361641991322.01992NaNNaN1996-11-16...None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDoubleNaNNaNNaNNaNNaN
1181591457054192501503681414713299.01979NaNNaN2009-10-01...NaNNaNNaNNaNNaNNone or UnspecifiedSemi UNone or UnspecifiedNaNNaN
2824012585918250145993467881326.01971NaNNaN1990-04-18...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
39634862827804160019161981427214999.02003NaNNaN2011-11-09...None or UnspecifiedNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedDoubleNaNNaNNaNNaNNaN
\n", - "

5 rows × 53 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "118276 1457333 58000 1434146 4147 132 1.0 \n", - "149220 1522457 37000 1473616 4199 132 2.0 \n", - "118159 1457054 19250 1503681 4147 132 99.0 \n", - "28240 1258591 8250 1459934 6788 132 6.0 \n", - "396348 6282780 41600 1916198 14272 149 99.0 \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand saledate ... \\\n", - "118276 1980 NaN NaN 1990-05-03 ... \n", - "149220 1992 NaN NaN 1996-11-16 ... \n", - "118159 1979 NaN NaN 2009-10-01 ... \n", - "28240 1971 NaN NaN 1990-04-18 ... \n", - "396348 2003 NaN NaN 2011-11-09 ... \n", - "\n", - " Undercarriage_Pad_Width Stick_Length Thumb \\\n", - "118276 NaN NaN NaN \n", - "149220 None or Unspecified None or Unspecified None or Unspecified \n", - "118159 NaN NaN NaN \n", - "28240 NaN NaN NaN \n", - "396348 None or Unspecified None or Unspecified None or Unspecified \n", - "\n", - " Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type \\\n", - "118276 NaN NaN None or Unspecified Straight \n", - "149220 None or Unspecified Double NaN NaN \n", - "118159 NaN NaN None or Unspecified Semi U \n", - "28240 NaN NaN NaN NaN \n", - "396348 None or Unspecified Double NaN NaN \n", - "\n", - " Travel_Controls Differential_Type Steering_Controls \n", - "118276 None or Unspecified NaN NaN \n", - "149220 NaN NaN NaN \n", - "118159 None or Unspecified NaN NaN \n", - "28240 NaN NaN NaN \n", - "396348 NaN NaN NaN \n", - "\n", - "[5 rows x 53 columns]" - ] - }, - "execution_count": 72, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Let's check out the training DataFrame\n", - "train_df.sample(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandsaledatefiModelDesc...Stick_LengthThumbPattern_ChangerGrouser_TypeBackhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlsSalePrice
7504432520823068264840172119879540.0Low2012-03-22850B...NaNNaNNaNNaNNone or UnspecifiedStraightNone or UnspecifiedNaNNaN10000.0
4853628222314823072884214902004NaNNaN2012-02-2380C...None or UnspecifiedNone or UnspecifiedNone or UnspecifiedDoubleNaNNaNNaNNaNNaN27000.0
2862698181791122725714999197848.0Low2012-01-11910...NaNNaNNaNNaNNaNNaNNaNStandardConventional10600.0
745112264782058981269121320049333.0Medium2012-03-22330CL...None or UnspecifiedHydraulicYesDoubleNaNNaNNaNNaNNaN90000.0
69312234791438073538121319972154.0Low2012-01-27416C...NaNNaNNaNNaNNaNNaNNaNNaNNaN19500.0
\n", - "

5 rows × 53 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "7504 4325208 2306826 4840 172 1 1987 \n", - "4853 6282223 1482307 28842 149 0 2004 \n", - "28 6269818 1791122 7257 149 99 1978 \n", - "7451 1226478 205898 1269 121 3 2004 \n", - "693 1223479 143807 3538 121 3 1997 \n", - "\n", - " MachineHoursCurrentMeter UsageBand saledate fiModelDesc ... \\\n", - "7504 9540.0 Low 2012-03-22 850B ... \n", - "4853 NaN NaN 2012-02-23 80C ... \n", - "28 48.0 Low 2012-01-11 910 ... \n", - "7451 9333.0 Medium 2012-03-22 330CL ... \n", - "693 2154.0 Low 2012-01-27 416C ... \n", - "\n", - " Stick_Length Thumb Pattern_Changer \\\n", - "7504 NaN NaN NaN \n", - "4853 None or Unspecified None or Unspecified None or Unspecified \n", - "28 NaN NaN NaN \n", - "7451 None or Unspecified Hydraulic Yes \n", - "693 NaN NaN NaN \n", - "\n", - " Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls \\\n", - "7504 NaN None or Unspecified Straight None or Unspecified \n", - "4853 Double NaN NaN NaN \n", - "28 NaN NaN NaN NaN \n", - "7451 Double NaN NaN NaN \n", - "693 NaN NaN NaN NaN \n", - "\n", - " Differential_Type Steering_Controls SalePrice \n", - "7504 NaN NaN 10000.0 \n", - "4853 NaN NaN 27000.0 \n", - "28 Standard Conventional 10600.0 \n", - "7451 NaN NaN 90000.0 \n", - "693 NaN NaN 19500.0 \n", - "\n", - "[5 rows x 53 columns]" - ] - }, - "execution_count": 73, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# And how about the validation DataFrame?\n", - "valid_df.sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice! \n", - "\n", - "We've now got separate training and validation datasets imported.\n", - "\n", - "In a previous section, we created a function to decompose the `saledate` column into multiple features such as `saleYear`, `saleMonth`, `saleDay` and more.\n", - "\n", - "Let's now replicate that function here and apply it to our `train_df` and `valid_df`." - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "metadata": {}, - "outputs": [], - "source": [ - "# Make a function to add date columns\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " # Add datetime parameters for saledate\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - "\n", - " # Drop original saledate column\n", - " df.drop(\"saledate\", axis=1, inplace=True)\n", - "\n", - " return df\n", - "\n", - "train_df = add_datetime_features_to_df(df=train_df)\n", - "valid_df = add_datetime_features_to_df(df=valid_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful, now let's make sure it worked by inspecting the last 5 columns of `train_df`." - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
saleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
31999820104223112
1335092003220351
29120020085234144
2801462006318577
33550920084291120
\n", - "
" - ], - "text/plain": [ - " saleYear saleMonth saleDay saleDayofweek saleDayofyear\n", - "319998 2010 4 22 3 112\n", - "133509 2003 2 20 3 51\n", - "291200 2008 5 23 4 144\n", - "280146 2006 3 18 5 77\n", - "335509 2008 4 29 1 120" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Display the last 5 columns (the recently added datetime breakdowns)\n", - "train_df.iloc[:, -5:].sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Perfect! How about we try and fit a model?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.1 Trying to fit a model on our training data\n", - "\n", - "I'm a big fan of trying to fit a model on your dataset as early as possible.\n", - "\n", - "If it works, you'll have to inspect and check its results.\n", - "\n", - "And if it doesn't work, you'll get some insights into what you may have to do to your dataset to prepare it.\n", - "\n", - "Let's turn our DataFrames into features (`X`) by dropping the `SalePrice` column (this is the value we're trying to predict) and labels (`y`) by extracting the `SalePrice` column.\n", - "\n", - "Then we'll create a model using `sklearn.ensemble.RandomForestRegressor` and finally we'll try to fit it to only the training data." - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "could not convert string to float: 'Medium'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/150598518.py\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 9\u001b[0m \u001b[0;31m# Create a model\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;31m# Fit a model to the training data only\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 13\u001b[0;31m model.fit(X=X_train,\n\u001b[0m\u001b[1;32m 14\u001b[0m y=y_train)\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1469\u001b[0m skip_parameter_validation=(\n\u001b[1;32m 1470\u001b[0m \u001b[0mprefer_skip_nested_validation\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mglobal_skip_validation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1471\u001b[0m )\n\u001b[1;32m 1472\u001b[0m ):\n\u001b[0;32m-> 1473\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfit_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0;31m# Validate or convert input data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 360\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 361\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"sparse multilabel-indicator for y is not supported.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 362\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 363\u001b[0;31m X, y = self._validate_data(\n\u001b[0m\u001b[1;32m 364\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 365\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 366\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 646\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"estimator\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcheck_y_params\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 647\u001b[0m \u001b[0mcheck_y_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m**\u001b[0m\u001b[0mdefault_check_params\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 648\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"y\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 649\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 650\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 651\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 652\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 653\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mcheck_params\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"ensure_2d\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 1297\u001b[0m raise ValueError(\n\u001b[1;32m 1298\u001b[0m \u001b[0;34mf\"{estimator_name} requires y to be passed, but the target y is None\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1299\u001b[0m )\n\u001b[1;32m 1300\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1301\u001b[0;31m X = check_array(\n\u001b[0m\u001b[1;32m 1302\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1303\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1304\u001b[0m \u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m )\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m ) from complex_warning\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m ) -> np.ndarray:\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Medium'" - ] - } - ], - "source": [ - "# Split training data into features and labels\n", - "X_train = train_df.drop(\"SalePrice\", axis=1)\n", - "y_train = train_df[\"SalePrice\"]\n", - "\n", - "# Split validation data into features and labels\n", - "X_valid = valid_df.drop(\"SalePrice\", axis=1)\n", - "y_valid = valid_df[\"SalePrice\"]\n", - "\n", - "# Create a model\n", - "model = RandomForestRegressor(n_jobs=-1)\n", - "\n", - "# Fit a model to the training data only\n", - "model.fit(X=X_train,\n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Oh no!\n", - "\n", - "We run into the error:\n", - "\n", - "> ValueError: could not convert string to float: 'Medium'\n", - "\n", - "Hmm... \n", - "\n", - "Where have we seen this error before?\n", - "\n", - "It looks like since we re-imported our training dataset (from `Train.csv`) its no longer all numerical (hence the `ValueError` above).\n", - "\n", - "Not to worry, we can fix this!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.2 Encoding categorical features as numbers using Scikit-Learn\n", - "\n", - "We've preprocessed our data previously with pandas.\n", - "\n", - "And while this is a viable approach, how about we practice using another method?\n", - "\n", - "This time we'll use Scikit-Learn's built-in preprocessing methods. \n", - "\n", - "Why?\n", - "\n", - "Because it's good exposure to different techniques.\n", - "\n", - "And Scikit-Learn has many built-in helpful and well tested methods for preparing data. \n", - "\n", - "You can also string together many of these methods and create a [reusable pipeline](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html) (you can think of this pipeline as plumbing for data).\n", - "\n", - "To preprocess our data with Scikit-Learn, we'll first define the numerical and categorical features of our dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Numeric features: ['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'saleYear', 'saleMonth', 'saleDay', 'saleDayofweek', 'saleDayofyear']\n", - "[INFO] Categorical features: ['UsageBand', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup']...\n" - ] - } - ], - "source": [ - "# Define numerical and categorical features\n", - "numerical_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]\n", - "categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]\n", - "\n", - "print(f\"[INFO] Numeric features: {numerical_features}\")\n", - "print(f\"[INFO] Categorical features: {categorical_features[:10]}...\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We define our different feature types so we can use different preprocessing methods on each type.\n", - "\n", - "Scikit-Learn has many built-in methods for preprocessing data under the [`sklearn.preprocessing` module](https://scikit-learn.org/stable/api/sklearn.preprocessing.html#).\n", - "\n", - "And I'd encourage you to spend some time reading the [preprocessing data section of the Scikit-Learn user guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) for more details.\n", - "\n", - "For now, let's focus on turning our categorical features into numbers (from object/string datatype to numeric datatype).\n", - "\n", - "The practice of turning non-numerical features into numerical features is often referred to as [**encoding**](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).\n", - "\n", - "There are several encoders available for different use cases.\n", - "\n", - "TK - does this table show up?\n", - "\n", - "| **Encoder** | **Description** | **Use case** | **For use on** |\n", - "|-------------|-----------------|--------------|----------------|\n", - "| [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) | Encode target labels with value between 0 and n_classes-1. | Useful for turning classification target values into numeric representations. | Target labels. |\n", - "| [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#onehotencoder) | Encode categorical features as a [one-hot numeric array](https://en.wikipedia.org/wiki/One-hot). | Turns every positive class of a unique category into a 1 and every negative class into a 0. | Categorical variables/features. |\n", - "| [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#ordinalencoder) | Encode categorical features as an integer array. | Turn unique categorical values into a range of integers, for example, 0 maps to \"cat\", 1 maps to \"dog\" and more. | Categorical variables/features. |\n", - "| [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#targetencoder) | Encode regression and classification targets into a shrunk estimate of the average target values for observations of the category. - Useful for converting targets into a certain range of values. | Target variables. |\n", - "\n", - "For our case, we're going to start with `OrdinalEncoder`.\n", - "\n", - "When transforming/encoding values with Scikit-Learn, the steps as follows:\n", - "\n", - "1. Instantiate an encoder, for example, `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Use the [`sklearn.preprocessing.OrdinalEncoder.fit`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit) method on the **training** data (this helps the encoder learn a mapping of categorical to numeric values).\n", - "3. Use the [`sklearn.preprocessing.OrdinalEncoder.transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.transform) method on the **training** data to apply the learned mapping from categorical to numeric values.\n", - " * **Note:** The [`sklearn.preprocessing.OrdinalEncoder.fit_transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit_transform) method combines steps 1 & 2 into a single method.\n", - "4. Apply the learned mapping to subsequent datasets such as **validation** and **test** splits using `sklearn.preprocessing.OrdinalEncoder.transform` only.\n", - "\n", - "Notice how the `fit` and `fit_transform` methods were reserved for the **training dataset only**.\n", - "\n", - "This is because in practice the validation and testing datasets are meant to be unseen, meaning only information from the training dataset should be used to preprocess the validation/test datasets.\n", - "\n", - "In short:\n", - "\n", - "1. Instantiate an encoder such as `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Fit the encoder to and transform the training dataset categorical variables/features with `sklearn.preprocessing.OrdinalEncoder.fit_transform`.\n", - "3. Transform categorical variables/features from subsequent datasets such as the validation and test datasets with the learned encoding from step 2 using `sklearn.preprocessing.OridinalEncoder.transform`. \n", - " * **Note:** Notice the use of the `transform` method on validation/test datasets rather than `fit_transform`.\n", - "\n", - "Let's do it!\n", - "\n", - "We'll use the `OrdinalEncoder` class to fill any missing values with `np.nan` (`NaN`).\n", - "\n", - "We'll also make sure to only use the `OrdinalEncoder` on the categorical features of our DataFrame.\n", - "\n", - "Finally, the `OrdinalEncoder` expects all input variables to be of the same type (e.g. either numeric only or string only) so we'll make sure all the input variables are strings only using [`pandas.DataFrame.astype(str)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We define our different feature types so we can use different preprocessing methods on each type.\n", - "\n", - "Scikit-Learn has many built-in methods for preprocessing data under the [`sklearn.preprocessing` module](https://scikit-learn.org/stable/api/sklearn.preprocessing.html#).\n", - "\n", - "And I'd encourage you to spend some time reading the [preprocessing data section of the Scikit-Learn user guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) for more details.\n", - "\n", - "For now, let's focus on turning our categorical features into numbers (from object/string datatype to numeric datatype).\n", - "\n", - "The practice of turning non-numerical features into numerical features is often referred to as [**encoding**](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).\n", - "\n", - "There are several encoders available for different use cases.\n", - "\n", - "TK - does this table show up?\n", - "\n", - "\n", - "\n", - "For our case, we're going to start with `OrdinalEncoder`.\n", - "\n", - "When transforming/encoding values with Scikit-Learn, the steps as follows:\n", - "\n", - "1. Instantiate an encoder, for example, `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Use the [`sklearn.preprocessing.OrdinalEncoder.fit`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit) method on the **training** data (this helps the encoder learn a mapping of categorical to numeric values).\n", - "3. Use the [`sklearn.preprocessing.OrdinalEncoder.transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.transform) method on the **training** data to apply the learned mapping from categorical to numeric values.\n", - " * **Note:** The [`sklearn.preprocessing.OrdinalEncoder.fit_transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit_transform) method combines steps 1 & 2 into a single method.\n", - "4. Apply the learned mapping to subsequent datasets such as **validation** and **test** splits using `sklearn.preprocessing.OrdinalEncoder.transform` only.\n", - "\n", - "Notice how the `fit` and `fit_transform` methods were reserved for the **training dataset only**.\n", - "\n", - "This is because in practice the validation and testing datasets are meant to be unseen, meaning only information from the training dataset should be used to preprocess the validation/test datasets.\n", - "\n", - "In short:\n", - "\n", - "1. Instantiate an encoder such as `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Fit the encoder to and transform the training dataset categorical variables/features with `sklearn.preprocessing.OrdinalEncoder.fit_transform`.\n", - "3. Transform categorical variables/features from subsequent datasets such as the validation and test datasets with the learned encoding from step 2 using `sklearn.preprocessing.OridinalEncoder.transform`. \n", - " * **Note:** Notice the use of the `transform` method on validation/test datasets rather than `fit_transform`.\n", - "\n", - "Let's do it!\n", - "\n", - "We'll use the `OrdinalEncoder` class to fill any missing values with `np.nan` (`NaN`).\n", - "\n", - "We'll also make sure to only use the `OrdinalEncoder` on the categorical features of our DataFrame.\n", - "\n", - "Finally, the `OrdinalEncoder` expects all input variables to be of the same type (e.g. either numeric only or string only) so we'll make sure all the input variables are strings only using [`pandas.DataFrame.astype(str)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn.preprocessing import OrdinalEncoder\n", - "\n", - "# 1. Create an ordinal encoder (turns category items into numeric representation)\n", - "ordinal_encoder = OrdinalEncoder(categories=\"auto\",\n", - " handle_unknown=\"use_encoded_value\",\n", - " unknown_value=np.nan,\n", - " encoded_missing_value=np.nan) # treat unknown categories as np.nan (or None)\n", - "\n", - "# 2. Fit and transform the categorical columns of X_train\n", - "X_train_preprocessed = X_train.copy() # make copies of the oringal DataFrames so we can keep the original values in tact and view them later\n", - "X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str)) # OrdinalEncoder expects all values as the same type (e.g. string or numeric only)\n", - "\n", - "# 3. Transform the categorical columns of X_valid \n", - "X_valid_preprocessed = X_valid.copy()\n", - "X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str)) # only use `transform` on the validation data" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful! \n", - "\n", - "Let's see if it worked.\n", - "\n", - "First, we'll check out the original `X_train` DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
20561516467701126363843413218.01974NaNNaNTD20TD20...None or UnspecifiedStraightNone or UnspecifiedNaNNaN1989117117
9280314040191169900711013299.01986NaNNaN416416...NaNNaNNaNNaNNaN1989131131
9834614156461262088335713299.01975NaNNaN12G12...NaNNaNNaNNaNNaN1989131131
16929715963581433229824713299.01978NaNNaN644644...NaNNaNNaNStandardConventional1989131131
274835182151411940891015013299.01980NaNNaNA66A66...NaNNaNNaNStandardConventional1989131131
\n", - "

5 rows × 56 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "205615 1646770 1126363 8434 132 18.0 1974 \n", - "92803 1404019 1169900 7110 132 99.0 1986 \n", - "98346 1415646 1262088 3357 132 99.0 1975 \n", - "169297 1596358 1433229 8247 132 99.0 1978 \n", - "274835 1821514 1194089 10150 132 99.0 1980 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "205615 NaN NaN TD20 TD20 ... \n", - "92803 NaN NaN 416 416 ... \n", - "98346 NaN NaN 12G 12 ... \n", - "169297 NaN NaN 644 644 ... \n", - "274835 NaN NaN A66 A66 ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "205615 None or Unspecified Straight None or Unspecified NaN \n", - "92803 NaN NaN NaN NaN \n", - "98346 NaN NaN NaN NaN \n", - "169297 NaN NaN NaN Standard \n", - "274835 NaN NaN NaN Standard \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek \\\n", - "205615 NaN 1989 1 17 1 \n", - "92803 NaN 1989 1 31 1 \n", - "98346 NaN 1989 1 31 1 \n", - "169297 Conventional 1989 1 31 1 \n", - "274835 Conventional 1989 1 31 1 \n", - "\n", - " saleDayofyear \n", - "205615 17 \n", - "92803 31 \n", - "98346 31 \n", - "169297 31 \n", - "274835 31 \n", - "\n", - "[5 rows x 56 columns]" - ] - }, - "execution_count": 79, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And how about `X_train_preprocessed`?" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
20561516467701126363843413218.01974NaN3.04536.01734.0...0.07.05.04.05.01989117117
9280314040191169900711013299.01986NaN3.0734.0242.0...2.010.07.04.05.01989131131
9834614156461262088335713299.01975NaN3.081.018.0...2.010.07.04.05.01989131131
16929715963581433229824713299.01978NaN3.01157.0348.0...2.010.07.03.01.01989131131
274835182151411940891015013299.01980NaN3.01799.0556.0...2.010.07.03.01.01989131131
\n", - "

5 rows × 56 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "205615 1646770 1126363 8434 132 18.0 1974 \n", - "92803 1404019 1169900 7110 132 99.0 1986 \n", - "98346 1415646 1262088 3357 132 99.0 1975 \n", - "169297 1596358 1433229 8247 132 99.0 1978 \n", - "274835 1821514 1194089 10150 132 99.0 1980 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "205615 NaN 3.0 4536.0 1734.0 ... \n", - "92803 NaN 3.0 734.0 242.0 ... \n", - "98346 NaN 3.0 81.0 18.0 ... \n", - "169297 NaN 3.0 1157.0 348.0 ... \n", - "274835 NaN 3.0 1799.0 556.0 ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "205615 0.0 7.0 5.0 4.0 \n", - "92803 2.0 10.0 7.0 4.0 \n", - "98346 2.0 10.0 7.0 4.0 \n", - "169297 2.0 10.0 7.0 3.0 \n", - "274835 2.0 10.0 7.0 3.0 \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek \\\n", - "205615 5.0 1989 1 17 1 \n", - "92803 5.0 1989 1 31 1 \n", - "98346 5.0 1989 1 31 1 \n", - "169297 1.0 1989 1 31 1 \n", - "274835 1.0 1989 1 31 1 \n", - "\n", - " saleDayofyear \n", - "205615 17 \n", - "92803 31 \n", - "98346 31 \n", - "169297 31 \n", - "274835 31 \n", - "\n", - "[5 rows x 56 columns]" - ] - }, - "execution_count": 80, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train_preprocessed.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "Notice all of the non-numerical values in `X_train` have been converted to numerical values in `X_train_preprocessed`.\n", - "\n", - "Now how about missing values?\n", - "\n", - "Let's see the top 10 columns with the highest number of missing values from `X_train`." - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Engine_Horsepower 375906\n", - "Blade_Extension 375906\n", - "Tip_Control 375906\n", - "Pushblock 375906\n", - "Enclosure_Type 375906\n", - "Blade_Width 375906\n", - "Scarifier 375895\n", - "Hydraulics_Flow 357763\n", - "Grouser_Tracks 357763\n", - "Coupler_System 357667\n", - "dtype: int64" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train[categorical_features].isna().sum().sort_values(ascending=False)[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, plenty of missing values.\n", - "\n", - "How about `X_train_preprocessed`?" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "UsageBand 0\n", - "fiModelDesc 0\n", - "Pushblock 0\n", - "Ripper 0\n", - "Scarifier 0\n", - "Tip_Control 0\n", - "Tire_Size 0\n", - "Coupler 0\n", - "Coupler_System 0\n", - "Grouser_Tracks 0\n", - "dtype: int64" - ] - }, - "execution_count": 82, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train_preprocessed[categorical_features].isna().sum().sort_values(ascending=False)[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Perfect! No missing values as well!\n", - "\n", - "Now, what if we wanted to retrieve the original categorical values?\n", - "\n", - "We can do using the [`OrdinalEncoder.categories_` attribute](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit_transform).\n", - "\n", - "This will return the categories of each feature found during `fit` (or during `fit_transform`), the categories will be in the order of the features seen (same order as the columns of the DataFrame)." - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[array(['High', 'Low', 'Medium', 'nan'], dtype=object),\n", - " array(['100C', '104', '1066', ..., 'ZX800LC', 'ZX80LCK', 'ZX850H'],\n", - " dtype=object),\n", - " array(['10', '100', '104', ..., 'ZX80', 'ZX800', 'ZX850'], dtype=object)]" - ] - }, - "execution_count": 83, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Let's inspect the first three categories\n", - "ordinal_encoder.categories_[:3]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Since these come in the order of the features seen, we can create a mapping of these using the categorical column names of our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 84, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{0: 'High', 1: 'Low', 2: 'Medium', 3: 'nan'}" - ] - }, - "execution_count": 84, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create a dictionary of dictionaries mapping column names and their variables to their numerical encoding\n", - "column_to_category_mapping = {}\n", - "\n", - "for column_name, category_values in zip(categorical_features, ordinal_encoder.categories_):\n", - " int_to_category = {i: category for i, category in enumerate(category_values)}\n", - " column_to_category_mapping[column_name] = int_to_category\n", - "\n", - "# Inspect an example column name to category mapping\n", - "column_to_category_mapping[\"UsageBand\"]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also reverse our `OrdinalEncoder` values with the [`inverse_transform()`](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.inverse_transform) method.\n", - "\n", - "This is helpful for reversing a preprocessing step or viewing the original data again if necessary." - ] - }, - { - "cell_type": "code", - "execution_count": 85, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
UsageBandfiModelDescfiBaseModelfiSecondaryDescfiModelSeriesfiModelDescriptorProductSizefiProductClassDescstateProductGroup...Undercarriage_Pad_WidthStick_LengthThumbPattern_ChangerGrouser_TypeBackhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_Controls
214315nan160CLC160CnanLCSmallHydraulic Excavator, Track - 14.0 to 16.0 Metr...AlabamaTEX...28 inchNone or UnspecifiedNone or UnspecifiedNone or UnspecifiedTriplenannannannannan
96782nanD4HD4HnannannanTrack Type Tractor, Dozer - 85.0 to 105.0 Hors...CaliforniaTTT...nannannannannanNone or UnspecifiedPATNone or Unspecifiednannan
224604nan140G140GnannannanMotorgrader - 145.0 to 170.0 HorsepowerMissouriMG...nannannannannannannannannannan
310524High966E966EnannanMediumWheel Loader - 200.0 to 225.0 HorsepowerMichiganWL...nannannannannannannannanStandardConventional
156716nan650H650HnannannanTrack Type Tractor, Dozer - 85.0 to 105.0 Hors...FloridaTTT...nannannannannanNone or UnspecifiedPATNone or Unspecifiednannan
\n", - "

5 rows × 44 columns

\n", - "
" - ], - "text/plain": [ - " UsageBand fiModelDesc fiBaseModel fiSecondaryDesc fiModelSeries \\\n", - "214315 nan 160CLC 160 C nan \n", - "96782 nan D4H D4 H nan \n", - "224604 nan 140G 140 G nan \n", - "310524 High 966E 966 E nan \n", - "156716 nan 650H 650 H nan \n", - "\n", - " fiModelDescriptor ProductSize \\\n", - "214315 LC Small \n", - "96782 nan nan \n", - "224604 nan nan \n", - "310524 nan Medium \n", - "156716 nan nan \n", - "\n", - " fiProductClassDesc state \\\n", - "214315 Hydraulic Excavator, Track - 14.0 to 16.0 Metr... Alabama \n", - "96782 Track Type Tractor, Dozer - 85.0 to 105.0 Hors... California \n", - "224604 Motorgrader - 145.0 to 170.0 Horsepower Missouri \n", - "310524 Wheel Loader - 200.0 to 225.0 Horsepower Michigan \n", - "156716 Track Type Tractor, Dozer - 85.0 to 105.0 Hors... Florida \n", - "\n", - " ProductGroup ... Undercarriage_Pad_Width Stick_Length \\\n", - "214315 TEX ... 28 inch None or Unspecified \n", - "96782 TTT ... nan nan \n", - "224604 MG ... nan nan \n", - "310524 WL ... nan nan \n", - "156716 TTT ... nan nan \n", - "\n", - " Thumb Pattern_Changer Grouser_Type \\\n", - "214315 None or Unspecified None or Unspecified Triple \n", - "96782 nan nan nan \n", - "224604 nan nan nan \n", - "310524 nan nan nan \n", - "156716 nan nan nan \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "214315 nan nan nan nan \n", - "96782 None or Unspecified PAT None or Unspecified nan \n", - "224604 nan nan nan nan \n", - "310524 nan nan nan Standard \n", - "156716 None or Unspecified PAT None or Unspecified nan \n", - "\n", - " Steering_Controls \n", - "214315 nan \n", - "96782 nan \n", - "224604 nan \n", - "310524 Conventional \n", - "156716 nan \n", - "\n", - "[5 rows x 44 columns]" - ] - }, - "execution_count": 85, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create a copy of the preprocessed DataFrame\n", - "X_train_unprocessed = X_train_preprocessed[categorical_features].copy()\n", - "\n", - "# This will return an array of the original untransformed data\n", - "X_train_unprocessed = ordinal_encoder.inverse_transform(X_train_unprocessed)\n", - "\n", - "# Turn back into a DataFrame for viewing pleasure\n", - "X_train_unprocessed_df = pd.DataFrame(X_train_unprocessed, columns=categorical_features)\n", - "\n", - "# Check out a sample\n", - "X_train_unprocessed_df.sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "Now how about we try fitting a model again?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3.3 Fitting a model to our preprocessed training data\n", - "\n", - "We've used Scikit-Learn to convert the categorical data in our training and validation sets into numbers.\n", - "\n", - "But we haven't yet done anything with missing numerical values.\n", - "\n", - "As it turns out, we can still try and fit a model.\n", - "\n", - "Why?\n", - "\n", - "Because there are several estimators/models in Scikit-Learn that can handle missing (`NaN`) values.\n", - "\n", - "And our trusty `sklearn.ensemble.RandomForestRegressor` is one of them!\n", - "\n", - "Let's try it out on our `X_train_preprocessed` DataFrame.\n", - "\n", - "> **Note:** For a list of all Scikit-Learn estimators that can handle `NaN` values, check out the [Scikit-Learn imputation of missing values user guide](https://scikit-learn.org/1.5/modules/impute.html#estimators-that-handle-nan-values). " - ] - }, - { - "cell_type": "code", - "execution_count": 86, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 8min 54s, sys: 6min 26s, total: 15min 20s\n", - "Wall time: 2min 40s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(n_jobs=-1)" - ] - }, - "execution_count": 86, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Instantiate a Random Forest Regression model\n", - "model = RandomForestRegressor(n_jobs=-1)\n", - "\n", - "# Fit the model to the preprocessed training data\n", - "model.fit(X=X_train_preprocessed,\n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It worked!\n", - "\n", - "Now you might be thinking, \"well if we could fit a model on a dataset with missing values, why did we bother filling them before?\"\n", - "\n", - "And that's a great question.\n", - "\n", - "The main reason is to *practice, practice, practice*. \n", - "\n", - "While there are some models which can handle missing values, others can't.\n", - "\n", - "So it's good to have experience with both of these scenarios.\n", - "\n", - "Let's see how our model scores on the validation set, data our model has never seen." - ] - }, - { - "cell_type": "code", - "execution_count": 88, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "could not convert string to float: 'Low'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;34m'Could not get source, probably due dynamically evaluated source code.'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 844\u001b[0m \"\"\"\n\u001b[1;32m 845\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 846\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mmetrics\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mr2_score\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 847\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 848\u001b[0;31m \u001b[0my_pred\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 849\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mr2_score\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_pred\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msample_weight\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msample_weight\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mpredicted\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1060\u001b[0m \"\"\"\n\u001b[1;32m 1061\u001b[0m \u001b[0mcheck_is_fitted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;31m# Check data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1063\u001b[0;31m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_X_predict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1064\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1065\u001b[0m \u001b[0;31m# Assign chunk of trees to jobs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1066\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_partition_estimators\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_estimators\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 637\u001b[0m \u001b[0mforce_all_finite\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"allow-nan\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 638\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 639\u001b[0m \u001b[0mforce_all_finite\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 640\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 641\u001b[0;31m X = self._validate_data(\n\u001b[0m\u001b[1;32m 642\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 643\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mDTYPE\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 644\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"csr\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 629\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 630\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 631\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 632\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 633\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"X\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 634\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 635\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_check_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m )\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m ) from complex_warning\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m ) -> np.ndarray:\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Low'" - ] - } - ], - "source": [ - "%%time\n", - "\n", - "# Check model performance on the validation set\n", - "model.score(X=X_valid,\n", - " y=y_valid)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Oops!\n", - "\n", - "Looks like we get an error:\n", - "\n", - "> `ValueError: could not convert string to float: 'Low'`\n", - "\n", - "This is because we tried to evaluate our model on the original `X_valid` dataset which still contains strings rather than `X_valid_preprocessed` which contains all numerical values.\n", - "\n", - "As we've discussed before, in machine learning problems, it's important to **evaluate your models on data in the same format as they were trained on**.\n", - "\n", - "Knowing this, let's evaluate our model on our preprocessed validation dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 89, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 766 ms, sys: 3.54 s, total: 4.3 s\n", - "Wall time: 1.27 s\n" - ] - }, - { - "data": { - "text/plain": [ - "0.8700295442271035" - ] - }, - "execution_count": 89, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Check model performance on the validation set\n", - "model.score(X=X_valid_preprocessed,\n", - " y=y_valid)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent!\n", - "\n", - "Now you might be wondering why this score ($R^2$ or R-squared by default) is lower than the previous score of ~0.9875.\n", - "\n", - "That's because this score is based on a model that has only seen the training data and is being evaluated on an unseen dataset (training on `Train.csv`, evaluating on `Valid.csv`).\n", - "\n", - "Our previous score was from a model that had all of the evaluation samples in the training data (training and evaluating on `TrainAndValid.csv`).\n", - "\n", - "So in practice, we would consider the most recent score as a much more reliable metric of how well our model might perform on future unseen data.\n", - "\n", - "Just for fun, let's see how our model scores on the training dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 90, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 17.6 s, sys: 19.2 s, total: 36.8 s\n", - "Wall time: 7.42 s\n" - ] - }, - { - "data": { - "text/plain": [ - "0.9872786621410867" - ] - }, - "execution_count": 90, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Check model performance on the training set\n", - "model.score(X=X_train_preprocessed,\n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As expected our model performs better on the training set than the validation set.\n", - "\n", - "It also scores much closer to the previous score of ~0.9875 we obtained when training and scoring on `TrainAndValid.csv` combined.\n", - "\n", - "> **Note:** It is common to see a model perform slightly worse on a validation/testing dataset than on a training set. This is because the model has seen all of the examples in the training set, where as, if done correctly, the validation and test sets are keep separate during training. So you would expect a model to do better on problems that it has seen before versus problems it hasn't. If you find your model scoring much higher on unseen data versus seen data (e.g. higher scores on the test set compared to the training set), you might want to inspect your data to make sure there isn't any leakage from the validation/test set into the training set." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Building an evaluation function\n", - "\n", - "Evaluating a machine learning model is just as important as training one.\n", - "\n", - "And so because of this, let's create an evaluation function to make evaluation faster and reproducible.\n", - "\n", - "According to Kaggle for the Bluebook for Bulldozers competition, [the evaluation function](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation) they use is root mean squared log error (RMSLE).\n", - "\n", - "$$ \\text{RMSLE} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^{n} \\left( \\log(1 + \\hat{y}_i) - \\log(1 + y_i) \\right)^2} $$\n", - "\n", - "Where:\n", - "\n", - "* $ \\hat{y}_i $ is the predicted value, \n", - "* $ y_i $ is the actual value, \n", - "* $ n $ is the number of observations.\n", - "\n", - "Contrast this with mean absolute error (MAE), another common regression metric.\n", - "\n", - "$$ \\text{MAE} = \\frac{1}{n} \\sum_{i=1}^{n} \\left| \\hat{y}_i - y_i \\right| $$\n", - "\n", - "With RMSLE, the relative error is more meaningful than the absolute error. You care more about ratios than absolute errors. For example, being off by $100 on a $1000 prediction (10% error) is more significant than being off by $100 on a $10,000 prediction (1% error). RMSLE is sensitive to large percentage errors.\n", - "\n", - "Where as with MAE, is more about exact differences, a $100 prediction error is weighted the same regardless of the actual value.\n", - "\n", - "In each of case, a lower value (closer to 0) is better.\n", - "\n", - "For any problem, it's important to define the evaluation metric you're going to try and improve on.\n", - "\n", - "In our case, let's create a function that calculates multiple evaluation metrics.\n", - "\n", - "Namely, we'll use:\n", - "\n", - "* MAE (mean absolute error) via [`sklearn.metrics.mean_absolute_error`](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_absolute_error.html) - lower is better.\n", - "* RMSLE (root mean squared log error) via [`sklearn.metrics.root_mean_squared_log_error`](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.root_mean_squared_log_error.html) - lower is better.\n", - "* $R^2$ (R-squared or coefficient of determination) via the [`score` method](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.score) - higher is better.\n", - "\n", - "For MAE and RMSLE we'll be comparing the model's predictions to the truth labels.\n", - "\n", - "We can get an array of predicted values from our model using [`model.predict(X=features_to_predict_on)`](https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.predict). " - ] - }, - { - "cell_type": "code", - "execution_count": 93, - "metadata": {}, - "outputs": [], - "source": [ - "# Create evaluation function (the competition uses Root Mean Square Log Error)\n", - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "\n", - "# Create function to evaluate our model\n", - "def show_scores(model, \n", - " train_features=X_train_preprocessed,\n", - " train_labels=y_train,\n", - " valid_features=X_valid_preprocessed,\n", - " valid_labels=y_valid):\n", - " \n", - " # Make predictions on train and validation features\n", - " train_preds = model.predict(X=train_features)\n", - " val_preds = model.predict(X=valid_features)\n", - "\n", - " # Create a scores dictionary of different evaluation metrics\n", - " scores = {\"Training MAE\": mean_absolute_error(y_true=train_labels, \n", - " y_pred=train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_true=valid_labels, \n", - " y_pred=val_preds),\n", - " \"Training RMSLE\": root_mean_squared_log_error(y_true=train_labels, \n", - " y_pred=train_preds),\n", - " \"Valid RMSLE\": root_mean_squared_log_error(y_true=valid_labels, \n", - " y_pred=val_preds),\n", - " \"Training R^2\": model.score(X=train_features, \n", - " y=train_labels),\n", - " \"Valid R^2\": model.score(X=valid_features, \n", - " y=valid_labels)}\n", - " return scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that's a nice looking function!\n", - "\n", - "How about we test it out?" - ] - }, - { - "cell_type": "code", - "execution_count": 94, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1596.4113176025767),\n", - " 'Valid MAE': np.float64(6172.124644142976),\n", - " 'Training RMSLE': np.float64(0.08546822305943352),\n", - " 'Valid RMSLE': np.float64(0.2576977236694938),\n", - " 'Training R^2': 0.9872786621410867,\n", - " 'Valid R^2': 0.8700295442271035}" - ] - }, - "execution_count": 94, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Try our model scoring function out\n", - "model_scores = show_scores(model=model)\n", - "model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "Now we can reuse this in the future for evaluating other models." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Tuning our model's hyperparameters\n", - "\n", - "Hyperparameters are the settings we can change on our model.\n", - "\n", - "And tuning hyperparameters on a given model can often alter its performance on a given dataset.\n", - "\n", - "Ideally, changing hyperparameters would lead to better results.\n", - "\n", - "However, it's often hard to know what hyperparameter changes would improve a model ahead of time.\n", - "\n", - "So what we can do is run several experiments across various different hyperparameter settings and record which lead to the best results." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.1 Making our modelling experiments faster (to speed up hyperparameter tuning)\n", - "\n", - "Because of the size of our dataset (~400,000 rows), retraining an entire model (about 1-1.5 minutes on my MacBook Pro M1 Pro) for each new set of hyperparameters would take far too long to continuing experimenting as fast as we want to.\n", - "\n", - "So what we'll do is take a sample of the training set and tune the hyperparameters on that before training a larger model.\n", - "\n", - "> **Note:** If you're experiments are taking longer than 10-seconds (or far longer than what you can interact with), you should be trying to speed things up. You can speed experiments up by sampling less data, using a faster computer or using a smaller model.\n", - "\n", - "We can take a artificial sample of the training set by altering the number of samples seen by each `n_estimator` (an `n_estimator` is a [decision tree](https://en.wikipedia.org/wiki/Decision_tree_learning) a random forest will create during training, more trees generally leads to better performance but sacrifices compute time) in [`sklearn.ensemble.RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) using the `max_samples` parameter.\n", - "\n", - "For example, setting `max_samples` to 10,000 means every `n_estimator` (default 100) in our `RandomForestRegressor` will only see 10,000 random samples from our DataFrame instead of the entire ~400,000.\n", - "\n", - "In other words, we'll be looking at 40x less samples which means we should get faster computation speeds but we should also expect our results to worsen (because the model has less samples to learn patterns from).\n", - "\n", - "Let's see if reducing the number samples speeds up our modelling time." - ] - }, - { - "cell_type": "code", - "execution_count": 95, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 19.2 s, sys: 18.5 s, total: 37.6 s\n", - "Wall time: 7.53 s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_samples=10000, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_samples=10000, n_jobs=-1)" - ] - }, - "execution_count": 95, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Change max samples in RandomForestRegressor\n", - "model = RandomForestRegressor(n_estimators=100, # this is the default\n", - " n_jobs=-1,\n", - " max_samples=10000) # each estimator sees max_samples (the default is to see all available samples)\n", - "\n", - "# Cutting down the max number of samples each tree can see improves training time\n", - "model.fit(X_train_preprocessed, \n", - " y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice! That worked much faster than training on the whole dataset.\n", - "\n", - "Let's evaluate our model with our `show_scores` function." - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(5605.344206319725),\n", - " 'Valid MAE': np.float64(7176.1651147786515),\n", - " 'Training RMSLE': np.float64(0.26030112528907273),\n", - " 'Valid RMSLE': np.float64(0.2935839690284876),\n", - " 'Training R^2': 0.858111849057448,\n", - " 'Valid R^2': 0.828549722372896}" - ] - }, - "execution_count": 96, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get evaluation metrics from reduced sample model\n", - "base_model_scores = show_scores(model=model)\n", - "base_model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent! Even though our new model saw far less data than the previous model, it still looks to be performing quite well.\n", - "\n", - "With this faster model, we can start to run a series of different hyperparameter experiments." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.2 Hyperparameter tuning with RandomizedSearchCV\n", - "\n", - "The goal of hyperparameter tuning is to values for our model's settings which lead to better results.\n", - "\n", - "We could sit there and do this by hand, adjusting parameters on `sklearn.ensemble.RandomForestRegressor` such as `n_estimators`, `max_depth`, `min_samples_split` and more.\n", - "\n", - "However, this would quite tedious.\n", - "\n", - "Instead, we can define a dictionary of hyperparametmer settings in the form `{\"hyperparamter_name\": [values_to_test]}` and then use [`sklearn.model_selection.RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#randomizedsearchcv) (randomly search for best combination of hyperparameters) or [`sklearn.model_selection.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#gridsearchcv) (exhaustively search for best combination of hyperparameters) to go through all of these settings for us on a given model and dataset and then record which perform best.\n", - "\n", - "A general workflow is to start with a large number and wide range of potential settings and use `RandomizedSearchCV` to search across these randomly for a limited number of iterations (e.g. `n_iter=10`).\n", - "\n", - "And then take the best results and narrow the search space down before exhaustively search for the best hyperparameters with `GridSearchCV`.\n", - "\n", - "Let's start trying to find better hyperparameters by:\n", - "\n", - "1. Define a dictionary of hyperparameter values for our `RandomForestRegressor` model. We'll keep `max_samples=10000` so our experiments run faster.\n", - "2. Setup an instance of `RandomizedSearchCV` to explore the parameter values defined in step 1. We can adjust how many sets of hyperparameters our model tries using the `n_iter` parameter as well as how many times our model performs cross-validation using the `cv` parameter. For example, setting `n_iter=20` and `cv=3` means there will be 3 cross-validation folds for each of the 20 different combinations of hyperparameters, a total of 60 (3*20) experiments.\n", - "3. Fit the instance of `RandomizedSearchCV` to the data. This will automatically go through the defined number of iterations and record the results for each. The best model gets loaded at the end.\n", - "\n", - "> **Note:** You can read more about the [tuning of hyperparameters of an esimator/model in the Scikit-Learn user guide](https://scikit-learn.org/stable/modules/grid_search.html#tuning-the-hyper-parameters-of-an-estimator). " - ] - }, - { - "cell_type": "code", - "execution_count": 97, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fitting 3 folds for each of 20 candidates, totalling 60 fits\n", - "[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.539 total time= 21.6s\n", - "[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.720 total time= 23.0s\n", - "[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=1, min_samples_split=3, n_estimators=160;, score=0.596 total time= 22.2s\n", - "[CV 1/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.491 total time= 3.3s\n", - "[CV 2/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.655 total time= 3.4s\n", - "[CV 3/3] END max_depth=10, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=60;, score=0.614 total time= 3.3s\n", - "[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.520 total time= 6.7s\n", - "[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.702 total time= 6.6s\n", - "[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=8, n_estimators=130;, score=0.636 total time= 6.6s\n", - "[CV 1/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.512 total time= 3.0s\n", - "[CV 2/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.703 total time= 2.7s\n", - "[CV 3/3] END max_depth=20, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=9, n_estimators=30;, score=0.636 total time= 2.7s\n", - "[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.541 total time= 9.9s\n", - "[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.745 total time= 11.1s\n", - "[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=3, n_estimators=100;, score=0.632 total time= 10.2s\n", - "[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.529 total time= 6.1s\n", - "[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.713 total time= 5.5s\n", - "[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=3, min_samples_split=8, n_estimators=50;, score=0.625 total time= 5.2s\n", - "[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.532 total time= 13.9s\n", - "[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.712 total time= 14.6s\n", - "[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=4, min_samples_split=6, n_estimators=170;, score=0.631 total time= 14.2s\n", - "[CV 1/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.545 total time= 6.6s\n", - "[CV 2/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.767 total time= 6.5s\n", - "[CV 3/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=5, n_estimators=40;, score=0.619 total time= 6.2s\n", - "[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.518 total time= 6.0s\n", - "[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.703 total time= 6.4s\n", - "[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=120;, score=0.637 total time= 6.2s\n", - "[CV 1/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.533 total time= 10.7s\n", - "[CV 2/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.708 total time= 13.7s\n", - "[CV 3/3] END max_depth=10, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=9, n_estimators=120;, score=0.628 total time= 10.5s\n", - "[CV 1/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.542 total time= 20.0s\n", - "[CV 2/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.756 total time= 19.9s\n", - "[CV 3/3] END max_depth=20, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=7, n_estimators=90;, score=0.611 total time= 17.1s\n", - "[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.524 total time= 9.1s\n", - "[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.706 total time= 8.7s\n", - "[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=8, min_samples_split=8, n_estimators=190;, score=0.637 total time= 8.6s\n", - "[CV 1/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.539 total time= 16.0s\n", - "[CV 2/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.754 total time= 14.8s\n", - "[CV 3/3] END max_depth=None, max_features=1.0, max_samples=10000, min_samples_leaf=5, min_samples_split=8, n_estimators=70;, score=0.615 total time= 14.2s\n", - "[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.548 total time= 9.3s\n", - "[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.766 total time= 8.6s\n", - "[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=1, min_samples_split=9, n_estimators=60;, score=0.623 total time= 9.0s\n", - "[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.535 total time= 23.8s\n", - "[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.715 total time= 25.5s\n", - "[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=170;, score=0.595 total time= 27.2s\n", - "[CV 1/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.544 total time= 12.0s\n", - "[CV 2/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.764 total time= 11.5s\n", - "[CV 3/3] END max_depth=20, max_features=0.5, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=80;, score=0.642 total time= 10.4s\n", - "[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.538 total time= 8.2s\n", - "[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.752 total time= 8.4s\n", - "[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=6, min_samples_split=7, n_estimators=70;, score=0.640 total time= 8.5s\n", - "[CV 1/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.537 total time= 9.8s\n", - "[CV 2/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.747 total time= 9.5s\n", - "[CV 3/3] END max_depth=None, max_features=0.5, max_samples=10000, min_samples_leaf=9, min_samples_split=5, n_estimators=90;, score=0.630 total time= 9.7s\n", - "[CV 1/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.536 total time= 28.3s\n", - "[CV 2/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.721 total time= 28.6s\n", - "[CV 3/3] END max_depth=10, max_features=1.0, max_samples=10000, min_samples_leaf=2, min_samples_split=8, n_estimators=180;, score=0.597 total time= 28.1s\n", - "[CV 1/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.539 total time= 9.0s\n", - "[CV 2/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.733 total time= 10.7s\n", - "[CV 3/3] END max_depth=None, max_features=sqrt, max_samples=10000, min_samples_leaf=2, min_samples_split=3, n_estimators=150;, score=0.643 total time= 8.9s\n", - "CPU times: user 8min 6s, sys: 25min 58s, total: 34min 4s\n", - "Wall time: 11min 54s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=20,\n",
-       "                   param_distributions={'max_depth': [None, 10, 20],\n",
-       "                                        'max_features': [0.5, 1.0, 'sqrt'],\n",
-       "                                        'max_samples': [10000],\n",
-       "                                        'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),\n",
-       "                                        'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]),\n",
-       "                                        'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,\n",
-       "       140, 150, 160, 170, 180, 190])},\n",
-       "                   verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=20,\n", - " param_distributions={'max_depth': [None, 10, 20],\n", - " 'max_features': [0.5, 1.0, 'sqrt'],\n", - " 'max_samples': [10000],\n", - " 'min_samples_leaf': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),\n", - " 'min_samples_split': array([2, 3, 4, 5, 6, 7, 8, 9]),\n", - " 'n_estimators': array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130,\n", - " 140, 150, 160, 170, 180, 190])},\n", - " verbose=3)" - ] - }, - "execution_count": 97, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "from sklearn.model_selection import RandomizedSearchCV\n", - "\n", - "# 1. Define a dictionary with different values for RandomForestRegressor hyperparameters\n", - "# See documatation for potential different values - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html \n", - "rf_grid = {\"n_estimators\": np.arange(10, 200, 10),\n", - " \"max_depth\": [None, 10, 20],\n", - " \"min_samples_split\": np.arange(2, 10, 1), # min_samples_split must be an int in the range [2, inf) or a float in the range (0.0, 1.0]\n", - " \"min_samples_leaf\": np.arange(1, 10, 1),\n", - " \"max_features\": [0.5, 1.0, \"sqrt\"], # Note: \"max_features='auto'\" is equivalent to \"max_features=1.0\", as of Scikit-Learn version 1.1\n", - " \"max_samples\": [10000]}\n", - "\n", - "# 2. Setup instance of RandomizedSearchCV to explore different parameters \n", - "rs_model = RandomizedSearchCV(estimator=RandomForestRegressor(), # can pass new model instance directly, all settings will be taken from the rf_grid\n", - " param_distributions=rf_grid,\n", - " n_iter=20,\n", - " # scoring=\"neg_root_mean_squared_log_error\", # want to optimize for RMSLE, though sometimes optimizing for the default metric (R^2) can lead to just as good results all round\n", - " cv=3,\n", - " verbose=3) # control how much output gets produced, higher number = more output\n", - "\n", - "# 3. Fit the model using a series of different hyperparameter values\n", - "rs_model.fit(X=X_train_preprocessed, \n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Phew! That's quite a bit of testing!\n", - "\n", - "Good news for us is that we can check the best hyperparameters with the `best_params_` attribute." - ] - }, - { - "cell_type": "code", - "execution_count": 113, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'n_estimators': np.int64(80),\n", - " 'min_samples_split': np.int64(3),\n", - " 'min_samples_leaf': np.int64(2),\n", - " 'max_samples': 10000,\n", - " 'max_features': 0.5,\n", - " 'max_depth': 20}" - ] - }, - "execution_count": 113, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find the best parameters from RandomizedSearchCV\n", - "rs_model.best_params_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And we can evaluate this model with our `show_scores` function." - ] - }, - { - "cell_type": "code", - "execution_count": 114, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(5804.886346446167),\n", - " 'Valid MAE': np.float64(7271.010705137403),\n", - " 'Training RMSLE': np.float64(0.2668477962708691),\n", - " 'Valid RMSLE': np.float64(0.2985683128197976),\n", - " 'Training R^2': 0.8494436266937344,\n", - " 'Valid R^2': 0.8280568050158131}" - ] - }, - "execution_count": 114, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Evaluate the RandomizedSearch model\n", - "rs_model_scores = show_scores(rs_model)\n", - "rs_model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.3 Training a model with the best hyperparameters\n", - "\n", - "Like all good machine learning cooking shows, I prepared a model earlier. \n", - "\n", - "I tried 100 different combinations of hyperparameters (setting `n_iter=100` in `RandomizedSearchCV`) and found the best results came from the settings below.\n", - "\n", - "* `n_estimators=90`\n", - "* `max_depth=None`\n", - "* `min_samples_leaf=1`\n", - "* `min_samples_split=5`\n", - "* `max_features=0.5`\n", - "* `n_jobs=-1`\n", - "* `max_samples=None`\n", - "\n", - "> **Note:** This search (`n_iter=100`) took ~2-hours on my MacBook Pro M1 Pro. So it's kind of a set and come back later experiment. That's one of the things you'll have to get used to as a machine learning engineer, figuring out what to do whilst your model trains. I like to go for long walks or to the gym (rule of thumb: while my model trains, I train).\n", - "\n", - "We'll instantiate a new model with these discovered hyperparameters and reset the `max_samples` back to its original value." - ] - }, - { - "cell_type": "code", - "execution_count": 115, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 4min 6s, sys: 4min 34s, total: 8min 40s\n", - "Wall time: 2min\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=90,\n",
-       "                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=90,\n", - " n_jobs=-1)" - ] - }, - "execution_count": 115, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Create a model with best found hyperparameters \n", - "# Note: There may be better values out there with longer searches but these are \n", - "# the best I found with a ~2 hour search. A good challenge would be to see if you \n", - "# can find better values.\n", - "ideal_model = RandomForestRegressor(n_estimators=90,\n", - " max_depth=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=5,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None)\n", - "\n", - "# Fit a model to the preprocessed data\n", - "ideal_model.fit(X=X_train_preprocessed, \n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And of course, we can evaluate our `ideal_model` with our `show_scores` function." - ] - }, - { - "cell_type": "code", - "execution_count": 116, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 28.8 s, sys: 37.4 s, total: 1min 6s\n", - "Wall time: 14.5 s\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1955.980118634043),\n", - " 'Valid MAE': np.float64(5979.47564414195),\n", - " 'Training RMSLE': np.float64(0.10224456852444506),\n", - " 'Valid RMSLE': np.float64(0.24733387014318542),\n", - " 'Training R^2': 0.9809704227866279,\n", - " 'Valid R^2': 0.8810497144604977}" - ] - }, - "execution_count": 116, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Evaluate ideal model\n", - "ideal_model_scores = show_scores(model=ideal_model)\n", - "ideal_model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woohoo!\n", - "\n", - "With these new hyperparameters as well as using all the samples, we can see an improvement to our models performance.\n", - "\n", - "One thing to keep in mind is that a larger model isn't always the best for a given problem even if it performs better.\n", - "\n", - "For example, you may require a model that performs inference (makes predictions) very fast with a slight tradeoff to performance.\n", - "\n", - "One way to a faster model is by altering some of the hyperparameters to create a smaller overall model. \n", - "\n", - "Particularly by lowering `n_estimators` since each increase in `n_estimators` is basically building another small model.\n", - "\n", - "Let's half our `n_estimators` value and see how it goes." - ] - }, - { - "cell_type": "code", - "execution_count": 117, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 2min, sys: 1min 58s, total: 3min 58s\n", - "Wall time: 44.9 s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=45,\n",
-       "                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=45,\n", - " n_jobs=-1)" - ] - }, - "execution_count": 117, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Halve the number of estimators\n", - "fast_model = RandomForestRegressor(n_estimators=45,\n", - " max_depth=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=5,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None)\n", - "\n", - "# Fit the faster model to the data\n", - "fast_model.fit(X=X_train_preprocessed, \n", - " y=y_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice! The faster model fits to the training data in about half the time of the full model.\n", - "\n", - "Now how does it go on performance?\n" - ] - }, - { - "cell_type": "code", - "execution_count": 118, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 14.6 s, sys: 23.7 s, total: 38.3 s\n", - "Wall time: 9.59 s\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1989.0544948757317),\n", - " 'Valid MAE': np.float64(6029.137329100962),\n", - " 'Training RMSLE': np.float64(0.10373049008046713),\n", - " 'Valid RMSLE': np.float64(0.24897544966690316),\n", - " 'Training R^2': 0.9802744452357592,\n", - " 'Valid R^2': 0.8788749110488039}" - ] - }, - "execution_count": 118, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Get results from the fast model\n", - "fast_model_scores = show_scores(model=fast_model)\n", - "fast_model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! Looks like our faster model evaluates (performs inference/makes predictions) in about half the time too.\n", - "\n", - "And only for a small tradeoff in validation RMSLE performance." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.4 Comparing our model's scores\n", - "\n", - "We've built four models so far with varying amounts of data and hyperparameters.\n", - "\n", - "Let's compile the results into a DataFrame and then make a plot to compare them." - ] - }, - { - "cell_type": "code", - "execution_count": 119, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Training MAEValid MAETraining RMSLEValid RMSLETraining R^2Valid R^2model_name
15804.8863467271.0107050.2668480.2985680.8494440.828057random_search_model
05605.3442067176.1651150.2603010.2935840.8581120.828550default_model
31989.0544956029.1373290.1037300.2489750.9802740.878875fast_model
21955.9801195979.4756440.1022450.2473340.9809700.881050ideal_model
\n", - "
" - ], - "text/plain": [ - " Training MAE Valid MAE Training RMSLE Valid RMSLE Training R^2 \\\n", - "1 5804.886346 7271.010705 0.266848 0.298568 0.849444 \n", - "0 5605.344206 7176.165115 0.260301 0.293584 0.858112 \n", - "3 1989.054495 6029.137329 0.103730 0.248975 0.980274 \n", - "2 1955.980119 5979.475644 0.102245 0.247334 0.980970 \n", - "\n", - " Valid R^2 model_name \n", - "1 0.828057 random_search_model \n", - "0 0.828550 default_model \n", - "3 0.878875 fast_model \n", - "2 0.881050 ideal_model " - ] - }, - "execution_count": 119, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Add names of models to dictionaries\n", - "base_model_scores[\"model_name\"] = \"default_model\"\n", - "rs_model_scores[\"model_name\"] = \"random_search_model\"\n", - "ideal_model_scores[\"model_name\"] = \"ideal_model\" \n", - "fast_model_scores[\"model_name\"] = \"fast_model\" \n", - "\n", - "# Turn all model score dictionaries into a list\n", - "all_model_scores = [base_model_scores, \n", - " rs_model_scores, \n", - " ideal_model_scores,\n", - " fast_model_scores]\n", - "\n", - "# Create DataFrame and sort model scores by validation RMSLE\n", - "model_comparison_df = pd.DataFrame(all_model_scores).sort_values(by=\"Valid RMSLE\", ascending=False)\n", - "model_comparison_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've got our model result data in a DataFrame, let's turn it into a bar plot comparing the validation RMSLE of each model." - ] - }, - { - "cell_type": "code", - "execution_count": 120, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Get mean RSMLE score of all models\n", - "mean_rsmle_score = model_comparison_df[\"Valid RMSLE\"].mean()\n", - "\n", - "# Plot validation RMSLE against each other \n", - "plt.figure(figsize=(10, 5))\n", - "plt.bar(x=model_comparison_df[\"model_name\"],\n", - " height=model_comparison_df[\"Valid RMSLE\"].values)\n", - "plt.xlabel(\"Model\")\n", - "plt.ylabel(\"Validation RMSLE (lower is better)\")\n", - "plt.xticks(rotation=0, fontsize=10);\n", - "plt.axhline(y=mean_rsmle_score, \n", - " color=\"red\", \n", - " linestyle=\"--\", \n", - " label=f\"Mean RMSLE: {mean_rsmle_score:.4f}\")\n", - "plt.legend();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "By the looks of the plot, our `ideal_model` is indeed the ideal model, slightly edging out `fast_model` in terms of validation RMSLE." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 6. Saving our best model to file\n", - "\n", - "Since we've confirmed our best model as our `ideal_model` object, we can save it to file so we can load it in later and use it without having to retrain it.\n", - "\n", - "> **Note:** For more on model saving options with Scikit-Learn, see the [documentation on model persistence](https://scikit-learn.org/stable/model_persistence.html).\n", - "\n", - "To save our model we can use the [`joblib.dump`](https://joblib.readthedocs.io/en/stable/generated/joblib.dump.html) method." - ] - }, - { - "cell_type": "code", - "execution_count": 121, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['randomforest_regressor_best_RMSLE.pkl']" - ] - }, - "execution_count": 121, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import joblib\n", - "\n", - "bulldozer_price_prediction_model_name = \"randomforest_regressor_best_RMSLE.pkl\"\n", - "\n", - "# Save model to file\n", - "joblib.dump(value=ideal_model, \n", - " filename=bulldozer_price_prediction_model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And to load our model we can use the [`joblib.load`](https://joblib.readthedocs.io/en/stable/generated/joblib.load.html) method." - ] - }, - { - "cell_type": "code", - "execution_count": 122, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=90,\n",
-       "                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_features=0.5, min_samples_split=5, n_estimators=90,\n", - " n_jobs=-1)" - ] - }, - "execution_count": 122, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Load the best model\n", - "best_model = joblib.load(filename=bulldozer_price_prediction_model_name)\n", - "best_model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can make sure our model saving and loading worked by evaluating our `best_model` with `show_scores`." - ] - }, - { - "cell_type": "code", - "execution_count": 123, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1955.9801186340424),\n", - " 'Valid MAE': np.float64(5979.47564414195),\n", - " 'Training RMSLE': np.float64(0.10224456852444506),\n", - " 'Valid RMSLE': np.float64(0.24733387014318542),\n", - " 'Training R^2': 0.9809704227866279,\n", - " 'Valid R^2': 0.8810497144604977}" - ] - }, - "execution_count": 123, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Confirm that the model works\n", - "best_model_scores = show_scores(model=best_model)\n", - "best_model_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And to confirm our `ideal_model` and `best_model` results are very close (if not the exact same), we can compare them with:\n", - "* The equality operator `==`.\n", - "* [`np.iclose`](https://numpy.org/doc/stable/reference/generated/numpy.isclose.html) and setting the absolute tolerance (`atol`) to `1e-4`." - ] - }, - { - "cell_type": "code", - "execution_count": 124, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "np.True_" - ] - }, - "execution_count": 124, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# See if loaded model and pre-saved model results are the same\n", - "# Note: these values may be very slightly different depending on how precise your computer stores values.\n", - "best_model_scores[\"Valid RMSLE\"] == ideal_model_scores[\"Valid RMSLE\"]" - ] - }, - { - "cell_type": "code", - "execution_count": 125, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model results are close!\n" - ] - } - ], - "source": [ - "# Is the loaded model as good as the non-loaded model?\n", - "if np.isclose(a=best_model_scores[\"Valid RMSLE\"], \n", - " b=ideal_model_scores[\"Valid RMSLE\"],\n", - " atol=1e-4): # Make sure values are within 0.0001 of each other\n", - " print(f\"[INFO] Model results are close!\")\n", - "else:\n", - " print(f\"[INFO] Model results aren't close, did something go wrong?\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> **Note:** When saving and loading a model, it is often the case to have very slightly different values at the extremes. For example, the pre-saved model may have an RMSLE of `0.24654150224930685` where as the loaded model may have an RMSLE of `0.24654150224930684` where in this case the values are off by `0.00000000000000001` (a very small number). This is due to the [precision of computing](https://en.wikipedia.org/wiki/Precision_(computer_science)) and the way computers store values, where numbers are exact but can be represented up to a certain amount of precision. This is why we generally compare results with many decimals using `np.isclose` rather than the `==` operator." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 7. Making predictions on test data\n", - "\n", - "Now we've got a trained model saved and loaded, it's time to make predictions on the test data.\n", - "\n", - "Our model is trained on data prior to 2011, however, the test data is from May 1 2012 to November 2012.\n", - "\n", - "So what we're doing is trying to use the patterns our model has learned from the training data to predict the sale price of a bulldozer with characteristics it's never seen before but are assumed to be similar to that of those in the training data.\n", - "\n", - "Let's load in the test data from `Test.csv`, we'll make sure to parse the dates of the `saledate` column." - ] - }, - { - "cell_type": "code", - "execution_count": 126, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandsaledatefiModelDesc...Undercarriage_Pad_WidthStick_LengthThumbPattern_ChangerGrouser_TypeBackhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_Controls
0122782910063093168121319993688.0Low2012-05-03580G...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11227844102281772711213100028555.0High2012-05-10936...NaNNaNNaNNaNNaNNaNNaNNaNStandardConventional
21227847103156022805121320046038.0Medium2012-05-10EC210BLC...None or Unspecified9' 6\"ManualNone or UnspecifiedDoubleNaNNaNNaNNaNNaN
31227848562041269121320068940.0High2012-05-10330CL...None or UnspecifiedNone or UnspecifiedManualYesTripleNaNNaNNaNNaNNaN
41227863105388722312121320052286.0Low2012-05-10650K...NaNNaNNaNNaNNaNNone or UnspecifiedPATNone or UnspecifiedNaNNaN
\n", - "

5 rows × 52 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1227829 1006309 3168 121 3 1999 \n", - "1 1227844 1022817 7271 121 3 1000 \n", - "2 1227847 1031560 22805 121 3 2004 \n", - "3 1227848 56204 1269 121 3 2006 \n", - "4 1227863 1053887 22312 121 3 2005 \n", - "\n", - " MachineHoursCurrentMeter UsageBand saledate fiModelDesc ... \\\n", - "0 3688.0 Low 2012-05-03 580G ... \n", - "1 28555.0 High 2012-05-10 936 ... \n", - "2 6038.0 Medium 2012-05-10 EC210BLC ... \n", - "3 8940.0 High 2012-05-10 330CL ... \n", - "4 2286.0 Low 2012-05-10 650K ... \n", - "\n", - " Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN \n", - "2 None or Unspecified 9' 6\" Manual None or Unspecified \n", - "3 None or Unspecified None or Unspecified Manual Yes \n", - "4 NaN NaN NaN NaN \n", - "\n", - " Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN \n", - "2 Double NaN NaN NaN \n", - "3 Triple NaN NaN NaN \n", - "4 NaN None or Unspecified PAT None or Unspecified \n", - "\n", - " Differential_Type Steering_Controls \n", - "0 NaN NaN \n", - "1 Standard Conventional \n", - "2 NaN NaN \n", - "3 NaN NaN \n", - "4 NaN NaN \n", - "\n", - "[5 rows x 52 columns]" - ] - }, - "execution_count": 126, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Load the test data\n", - "test_df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/Test.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "test_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You might notice that the `test_df` is missing the `SalePrice` column.\n", - "\n", - "That's because that's the variable we're trying to predict based on all of the other variables.\n", - "\n", - "We can make predictions with our `best_model` using the [`predict` method](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.predict)." - ] - }, - { - "cell_type": "code", - "execution_count": 127, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "The feature names should match those that were passed during fit.\nFeature names unseen at fit time:\n- saledate\nFeature names seen at fit time, yet now missing:\n- saleDay\n- saleDayofweek\n- saleDayofyear\n- saleMonth\n- saleYear\n", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[127], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m# Let's see how the model goes predicting on the test data\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m test_preds \u001b[38;5;241m=\u001b[39m \u001b[43mbest_model\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpredict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mtest_df\u001b[49m\u001b[43m)\u001b[49m\n", - "File \u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:1063\u001b[0m, in \u001b[0;36mForestRegressor.predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 1061\u001b[0m check_is_fitted(\u001b[38;5;28mself\u001b[39m)\n\u001b[1;32m 1062\u001b[0m \u001b[38;5;66;03m# Check data\u001b[39;00m\n\u001b[0;32m-> 1063\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_X_predict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1065\u001b[0m \u001b[38;5;66;03m# Assign chunk of trees to jobs\u001b[39;00m\n\u001b[1;32m 1066\u001b[0m n_jobs, _, _ \u001b[38;5;241m=\u001b[39m _partition_estimators(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mn_estimators, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mn_jobs)\n", - "File \u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py:641\u001b[0m, in \u001b[0;36mBaseForest._validate_X_predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 638\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 639\u001b[0m force_all_finite \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 641\u001b[0m X \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_validate_data\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 642\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 643\u001b[0m \u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mDTYPE\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 644\u001b[0m \u001b[43m \u001b[49m\u001b[43maccept_sparse\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mcsr\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 645\u001b[0m \u001b[43m \u001b[49m\u001b[43mreset\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[1;32m 646\u001b[0m \u001b[43m \u001b[49m\u001b[43mforce_all_finite\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mforce_all_finite\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 647\u001b[0m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 648\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m issparse(X) \u001b[38;5;129;01mand\u001b[39;00m (X\u001b[38;5;241m.\u001b[39mindices\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m!=\u001b[39m np\u001b[38;5;241m.\u001b[39mintc \u001b[38;5;129;01mor\u001b[39;00m X\u001b[38;5;241m.\u001b[39mindptr\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m!=\u001b[39m np\u001b[38;5;241m.\u001b[39mintc):\n\u001b[1;32m 649\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mNo support for np.int64 index based sparse matrices\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n", - "File \u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py:608\u001b[0m, in \u001b[0;36mBaseEstimator._validate_data\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 537\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_validate_data\u001b[39m(\n\u001b[1;32m 538\u001b[0m \u001b[38;5;28mself\u001b[39m,\n\u001b[1;32m 539\u001b[0m X\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mno_validation\u001b[39m\u001b[38;5;124m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 544\u001b[0m \u001b[38;5;241m*\u001b[39m\u001b[38;5;241m*\u001b[39mcheck_params,\n\u001b[1;32m 545\u001b[0m ):\n\u001b[1;32m 546\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"Validate input data and set or check the `n_features_in_` attribute.\u001b[39;00m\n\u001b[1;32m 547\u001b[0m \n\u001b[1;32m 548\u001b[0m \u001b[38;5;124;03m Parameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 606\u001b[0m \u001b[38;5;124;03m validated.\u001b[39;00m\n\u001b[1;32m 607\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 608\u001b[0m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_check_feature_names\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mreset\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreset\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 610\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m y \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_get_tags()[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrequires_y\u001b[39m\u001b[38;5;124m\"\u001b[39m]:\n\u001b[1;32m 611\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 612\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mThis \u001b[39m\u001b[38;5;132;01m{\u001b[39;00m\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__class__\u001b[39m\u001b[38;5;241m.\u001b[39m\u001b[38;5;18m__name__\u001b[39m\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m estimator \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 613\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mrequires y to be passed, but the target y is None.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 614\u001b[0m )\n", - "File \u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py:535\u001b[0m, in \u001b[0;36mBaseEstimator._check_feature_names\u001b[0;34m(self, X, reset)\u001b[0m\n\u001b[1;32m 530\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m missing_names \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m unexpected_names:\n\u001b[1;32m 531\u001b[0m message \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m 532\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFeature names must be in the same order as they were in fit.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 533\u001b[0m )\n\u001b[0;32m--> 535\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(message)\n", - "\u001b[0;31mValueError\u001b[0m: The feature names should match those that were passed during fit.\nFeature names unseen at fit time:\n- saledate\nFeature names seen at fit time, yet now missing:\n- saleDay\n- saleDayofweek\n- saleDayofyear\n- saleMonth\n- saleYear\n" - ] - } - ], - "source": [ - "# Let's see how the model goes predicting on the test data\n", - "test_preds = best_model.predict(X=test_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Oh no!\n", - "\n", - "We get an error:\n", - "\n", - "> ValueError: The feature names should match those that were passed during fit.\n", - "> Feature names unseen at fit time:\n", - "> - saledate\n", - "> Feature names seen at fit time, yet now missing:\n", - "> - saleDay\n", - "> - saleDayofweek\n", - "> - saleDayofyear\n", - "> - saleMonth\n", - "> - saleYear\n", - "\n", - "Ahhh... the test data isn't in the same format of our other data, so we have to fix it. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 7.1 Preprocessing the test data (to be in the same format as the training data)\n", - "\n", - "Our model has been trained on data preprocessed in a certain way. \n", - "\n", - "This means in order to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.\n", - "\n", - "Remember, whatever you do to preprocess the training data, you have to do to the test data.\n", - "\n", - "Let's recreate the steps we used for preprocessing the training data except this time we'll do it on the test data. \n", - "\n", - "First, we'll add the extra date features to breakdown the `saledate` column." - ] - }, - { - "cell_type": "code", - "execution_count": 128, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
0122782910063093168121319993688.0Low580G580...NaNNaNNaNNaNNaN2012533124
11227844102281772711213100028555.0High936936...NaNNaNNaNStandardConventional20125103131
21227847103156022805121320046038.0MediumEC210BLCEC210...NaNNaNNaNNaNNaN20125103131
31227848562041269121320068940.0High330CL330...NaNNaNNaNNaNNaN20125103131
41227863105388722312121320052286.0Low650K650...None or UnspecifiedPATNone or UnspecifiedNaNNaN20125103131
\n", - "

5 rows × 56 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1227829 1006309 3168 121 3 1999 \n", - "1 1227844 1022817 7271 121 3 1000 \n", - "2 1227847 1031560 22805 121 3 2004 \n", - "3 1227848 56204 1269 121 3 2006 \n", - "4 1227863 1053887 22312 121 3 2005 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 3688.0 Low 580G 580 ... \n", - "1 28555.0 High 936 936 ... \n", - "2 6038.0 Medium EC210BLC EC210 ... \n", - "3 8940.0 High 330CL 330 ... \n", - "4 2286.0 Low 650K 650 ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN Standard \n", - "2 NaN NaN NaN NaN \n", - "3 NaN NaN NaN NaN \n", - "4 None or Unspecified PAT None or Unspecified NaN \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek saleDayofyear \n", - "0 NaN 2012 5 3 3 124 \n", - "1 Conventional 2012 5 10 3 131 \n", - "2 NaN 2012 5 10 3 131 \n", - "3 NaN 2012 5 10 3 131 \n", - "4 NaN 2012 5 10 3 131 \n", - "\n", - "[5 rows x 56 columns]" - ] - }, - "execution_count": 128, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Make a function to add date columns\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " # Add datetime parameters for saledate\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - "\n", - " # Drop original saledate column\n", - " df.drop(\"saledate\", axis=1, inplace=True)\n", - "\n", - " return df\n", - "\n", - "# Preprocess test_df to have same columns as train_df (add the datetime features)\n", - "test_df = add_datetime_features_to_df(df=test_df)\n", - "test_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Date features added!\n", - "\n", - "Now can we make predictions with our model on the test data? " - ] - }, - { - "cell_type": "code", - "execution_count": 129, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "could not convert string to float: 'Low'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_20423/2042912174.py\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Try to predict with model\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mtest_preds\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mbest_model\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpredict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtest_df\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 1059\u001b[0m \u001b[0mThe\u001b[0m \u001b[0mpredicted\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1060\u001b[0m \"\"\"\n\u001b[1;32m 1061\u001b[0m \u001b[0mcheck_is_fitted\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1062\u001b[0m \u001b[0;31m# Check data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1063\u001b[0;31m \u001b[0mX\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_validate_X_predict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1064\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1065\u001b[0m \u001b[0;31m# Assign chunk of trees to jobs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1066\u001b[0m \u001b[0mn_jobs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0m_\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_partition_estimators\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_estimators\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 637\u001b[0m \u001b[0mforce_all_finite\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m\"allow-nan\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 638\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 639\u001b[0m \u001b[0mforce_all_finite\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 640\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 641\u001b[0;31m X = self._validate_data(\n\u001b[0m\u001b[1;32m 642\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 643\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mDTYPE\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 644\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"csr\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 629\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 630\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 631\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 632\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 633\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"X\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 634\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_y\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 635\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_check_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 636\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m )\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m ) from complex_warning\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m ) -> np.ndarray:\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Low'" - ] - } - ], - "source": [ - "# Try to predict with model\n", - "test_preds = best_model.predict(test_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another error...\n", - "\n", - "> ValueError: could not convert string to float: 'Low'\n", - "\n", - "We can fix this by running our `ordinal_encoder` (that we used to preprocess the training data) on the categorical features in our test DataFrame. " - ] - }, - { - "cell_type": "code", - "execution_count": 130, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 12457 entries, 0 to 12456\n", - "Data columns (total 56 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 12457 non-null int64 \n", - " 1 MachineID 12457 non-null int64 \n", - " 2 ModelID 12457 non-null int64 \n", - " 3 datasource 12457 non-null int64 \n", - " 4 auctioneerID 12457 non-null int64 \n", - " 5 YearMade 12457 non-null int64 \n", - " 6 MachineHoursCurrentMeter 2129 non-null float64\n", - " 7 UsageBand 12457 non-null float64\n", - " 8 fiModelDesc 12349 non-null float64\n", - " 9 fiBaseModel 12431 non-null float64\n", - " 10 fiSecondaryDesc 12449 non-null float64\n", - " 11 fiModelSeries 12456 non-null float64\n", - " 12 fiModelDescriptor 12452 non-null float64\n", - " 13 ProductSize 12457 non-null float64\n", - " 14 fiProductClassDesc 12457 non-null float64\n", - " 15 state 12457 non-null float64\n", - " 16 ProductGroup 12457 non-null float64\n", - " 17 ProductGroupDesc 12457 non-null float64\n", - " 18 Drive_System 12457 non-null float64\n", - " 19 Enclosure 12457 non-null float64\n", - " 20 Forks 12457 non-null float64\n", - " 21 Pad_Type 12457 non-null float64\n", - " 22 Ride_Control 12457 non-null float64\n", - " 23 Stick 12457 non-null float64\n", - " 24 Transmission 12457 non-null float64\n", - " 25 Turbocharged 12457 non-null float64\n", - " 26 Blade_Extension 12457 non-null float64\n", - " 27 Blade_Width 12457 non-null float64\n", - " 28 Enclosure_Type 12457 non-null float64\n", - " 29 Engine_Horsepower 12457 non-null float64\n", - " 30 Hydraulics 12457 non-null float64\n", - " 31 Pushblock 12457 non-null float64\n", - " 32 Ripper 12457 non-null float64\n", - " 33 Scarifier 12457 non-null float64\n", - " 34 Tip_Control 12457 non-null float64\n", - " 35 Tire_Size 12457 non-null float64\n", - " 36 Coupler 12457 non-null float64\n", - " 37 Coupler_System 12457 non-null float64\n", - " 38 Grouser_Tracks 12457 non-null float64\n", - " 39 Hydraulics_Flow 12457 non-null float64\n", - " 40 Track_Type 12457 non-null float64\n", - " 41 Undercarriage_Pad_Width 12457 non-null float64\n", - " 42 Stick_Length 12457 non-null float64\n", - " 43 Thumb 12457 non-null float64\n", - " 44 Pattern_Changer 12457 non-null float64\n", - " 45 Grouser_Type 12457 non-null float64\n", - " 46 Backhoe_Mounting 12457 non-null float64\n", - " 47 Blade_Type 12457 non-null float64\n", - " 48 Travel_Controls 12457 non-null float64\n", - " 49 Differential_Type 12457 non-null float64\n", - " 50 Steering_Controls 12457 non-null float64\n", - " 51 saleYear 12457 non-null int32 \n", - " 52 saleMonth 12457 non-null int32 \n", - " 53 saleDay 12457 non-null int32 \n", - " 54 saleDayofweek 12457 non-null int32 \n", - " 55 saleDayofyear 12457 non-null int32 \n", - "dtypes: float64(45), int32(5), int64(6)\n", - "memory usage: 5.1 MB\n" - ] - } - ], - "source": [ - "# Create a copy of the test DataFrame to keep the original intact\n", - "test_df_preprocessed = test_df.copy()\n", - "\n", - "# Transform the categorical features of the test DataFrame into numbers\n", - "test_df_preprocessed[categorical_features] = ordinal_encoder.transform(test_df_preprocessed[categorical_features].astype(str))\n", - "test_df_preprocessed.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, date features created and categorical features turned into numbers, can we make predictions on the test data now?" - ] - }, - { - "cell_type": "code", - "execution_count": 131, - "metadata": {}, - "outputs": [], - "source": [ - "# Make predictions on the preprocessed test data\n", - "test_preds = best_model.predict(test_df_preprocessed)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Holy smokes! It worked!\n", - "\n", - "Let's check out our `test_preds`." - ] - }, - { - "cell_type": "code", - "execution_count": 132, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([14384.79497354, 31377.65862841, 48589.23540965, 95857.57194966,\n", - " 26910.53992304, 29401.41534392, 27061.53819945, 20377.23364598,\n", - " 17325.67857143, 33646.67768959])" - ] - }, - "execution_count": 132, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check the first 10 test predictions\n", - "test_preds[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful, looks like we're getting the price predictions of a given bulldozer.\n", - "\n", - "How many predictions are there?" - ] - }, - { - "cell_type": "code", - "execution_count": 133, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "((12457,), (12457, 56))" - ] - }, - "execution_count": 133, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check number of test predictions\n", - "test_preds.shape, test_df.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Perfect, looks like theres one prediction per sample in the test DataFrame.\n", - "\n", - "Now how would we submit our predictions to Kaggle?\n", - "\n", - "Well, when looking at the [Kaggle submission requirements](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation), we see that if we wanted to make a submission, the data is required to be in a certain format. \n", - "\n", - "Namely, a DataFrame containing the `SalesID` and the predicted `SalePrice` of the bulldozer.\n", - "\n", - "Let's make it." - ] - }, - { - "cell_type": "code", - "execution_count": 134, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePrice
6517630461978522.550705
922123120413500.628307
6859631105010891.180556
6634630773128503.776455
8882644729050641.411817
\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice\n", - "6517 6304619 78522.550705\n", - "922 1231204 13500.628307\n", - "6859 6311050 10891.180556\n", - "6634 6307731 28503.776455\n", - "8882 6447290 50641.411817" - ] - }, - "execution_count": 134, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create DataFrame compatible with Kaggle submission requirements\n", - "pred_df = pd.DataFrame()\n", - "pred_df[\"SalesID\"] = test_df[\"SalesID\"]\n", - "pred_df[\"SalePrice\"] = test_preds\n", - "pred_df.sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent! We've got a `SalePrice` prediction for every `SalesID` in the test DataFrame.\n", - "\n", - "Let's save this to CSV so we could upload it or share it with someone else if we had to." - ] - }, - { - "cell_type": "code", - "execution_count": 135, - "metadata": {}, - "outputs": [], - "source": [ - "# Export test dataset predictions to CSV\n", - "pred_df.to_csv(\"../data/bluebook-for-bulldozers/predictions.csv\",\n", - " index=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 8. Making a prediction on a custom sample\n", - "\n", - "We've made predictions on the test dataset which contains sale data from May to November 2012.\n", - "\n", - "But how does our model go on a more recent bulldozer sale?\n", - "\n", - "If we were to find an advertisement on a bulldozer sale, could we use our model on the information in the advertisement to predict the sale price?\n", - "\n", - "In other words, how could we use our model on a single custom sample?\n", - "\n", - "It's one thing to predict on data that has already been formatted but it's another thing to be able to predict a on a completely new and unseen sample.\n", - "\n", - "> **Note:** For predicting on a custom sample, the same rules apply as making predictions on the test dataset. The data you make predictions on should be in the same format that your model was trained on. For example, it should have all the same features and the numerical encodings should be in the same ballpark (e.g. preprocessed by the `ordinal_encoder` we fit to the training set). It's likely that samples you collect from the wild may not be as well formatted as samples in a pre-existing dataset. So it's the job of the machine learning engineer to be able to format/preprocess new samples in the same way a model was trained on.\n", - "\n", - "If we're going to make a prediction on a custom sample, it'll need to be in the same format as our other datasets.\n", - "\n", - "So let's remind ourselves of the columns/features in our test dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 136, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "[{'SalesID': 1229148,\n", - " 'MachineID': 1042578,\n", - " 'ModelID': 9579,\n", - " 'datasource': 121,\n", - " 'auctioneerID': 3,\n", - " 'YearMade': 2004,\n", - " 'MachineHoursCurrentMeter': 3290.0,\n", - " 'UsageBand': 'Medium',\n", - " 'fiModelDesc': 'S250',\n", - " 'fiBaseModel': 'S250',\n", - " 'fiSecondaryDesc': 'nan',\n", - " 'fiModelSeries': 'nan',\n", - " 'fiModelDescriptor': 'nan',\n", - " 'ProductSize': 'nan',\n", - " 'fiProductClassDesc': 'Skid Steer Loader - 2201.0 to 2701.0 Lb Operating Capacity',\n", - " 'state': 'Missouri',\n", - " 'ProductGroup': 'SSL',\n", - " 'ProductGroupDesc': 'Skid Steer Loaders',\n", - " 'Drive_System': 'nan',\n", - " 'Enclosure': 'EROPS',\n", - " 'Forks': 'None or Unspecified',\n", - " 'Pad_Type': 'nan',\n", - " 'Ride_Control': 'nan',\n", - " 'Stick': 'nan',\n", - " 'Transmission': 'nan',\n", - " 'Turbocharged': 'nan',\n", - " 'Blade_Extension': 'nan',\n", - " 'Blade_Width': 'nan',\n", - " 'Enclosure_Type': 'nan',\n", - " 'Engine_Horsepower': 'nan',\n", - " 'Hydraulics': 'Auxiliary',\n", - " 'Pushblock': 'nan',\n", - " 'Ripper': 'nan',\n", - " 'Scarifier': 'nan',\n", - " 'Tip_Control': 'nan',\n", - " 'Tire_Size': 'nan',\n", - " 'Coupler': 'Hydraulic',\n", - " 'Coupler_System': 'Yes',\n", - " 'Grouser_Tracks': 'None or Unspecified',\n", - " 'Hydraulics_Flow': 'Standard',\n", - " 'Track_Type': 'nan',\n", - " 'Undercarriage_Pad_Width': 'nan',\n", - " 'Stick_Length': 'nan',\n", - " 'Thumb': 'nan',\n", - " 'Pattern_Changer': 'nan',\n", - " 'Grouser_Type': 'nan',\n", - " 'Backhoe_Mounting': 'nan',\n", - " 'Blade_Type': 'nan',\n", - " 'Travel_Controls': 'nan',\n", - " 'Differential_Type': 'nan',\n", - " 'Steering_Controls': 'nan',\n", - " 'saleYear': 2012,\n", - " 'saleMonth': 6,\n", - " 'saleDay': 15,\n", - " 'saleDayofweek': 4,\n", - " 'saleDayofyear': 167}]" - ] - }, - "execution_count": 136, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get example from test_df\n", - "test_df_preprocessed_sample = test_df_preprocessed.sample(n=1, random_state=42)\n", - "\n", - "# Turn back into original format\n", - "test_df_unpreprocessed_sample = test_df_preprocessed_sample.copy() \n", - "test_df_unpreprocessed_sample[categorical_features] = ordinal_encoder.inverse_transform(test_df_unpreprocessed_sample[categorical_features])\n", - "test_df_unpreprocessed_sample.to_dict(orient=\"records\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful, so if we're going to make a prediction on a custom sample, we'll need to fill out these details as much as we can.\n", - "\n", - "Let's try and make a prediction on the example test sample." - ] - }, - { - "cell_type": "code", - "execution_count": 137, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([13519.31657848])" - ] - }, - "execution_count": 137, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Make a prediction on the preprocessed test sample\n", - "best_model.predict(test_df_preprocessed_sample)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We get an output array containing a predicted `SalePrice`.\n", - "\n", - "Let's now try it on a custom sample.\n", - "\n", - "Again, like all good machine learning cooking shows, I've searched the internet for \"bulldozer sales in America\" and [found a sale from 6th July 2024](https://www.purplewave.com/auction/240606/item/EK8504/2004-Caterpillar-D6R_XL-Crawlers-Crawler_Dozer-Missouri) (I'm writing these materials in mid 2024 so if it's many years in the future and the link doesn't work, check out the screenshot below). \n", - "\n", - "TK - image of bulldozer (add raw GitHub link from master)\n", - "\n", - "| \"Image | \n", - "|:--:| \n", - "| Screenshot of a bulldozer sale advertisement. I took information from this advertisement to create our own custom sample for testing our machine learning model on data from the wild. [Source](https://www.purplewave.com/auction/240606/item/EK8504/2004-Caterpillar-D6R_XL-Crawlers-Crawler_Dozer-Missouri). |\n", - "\n", - "I went through the advertisement online and collected as much detail as I could and formatted the dictionary below with all of the related fields.\n", - "\n", - "It may not be perfect but data in the real world is rarely perfect.\n", - "\n", - "For values I couldn't find or were inconspicuous, I filled them with `np.nan` (or `NaN`). \n", - "\n", - "Some values such as `SalesID` were unobtainable because they were part of the original collected dataset, for these I've also used `np.nan`.\n", - "\n", - "Also notice how I've already created the extra date features `saleYear`, `saleMonth`, `saleDay` and more by manually breaking down the listed sale date of 6 July 2024." - ] - }, - { - "cell_type": "code", - "execution_count": 138, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a dictionary of features and values from an internet-based bulldozer advertisement\n", - "# See link: https://www.purplewave.com/auction/240606/item/EK8504/2004-Caterpillar-D6R_XL-Crawlers-Crawler_Dozer-Missouri (note: this link is/was valid as of October 2024 but may be invalid in the future)\n", - "custom_sample = {\n", - " \"SalesID\": np.nan,\n", - " \"MachineID\": 8504,\n", - " \"ModelID\": np.nan,\n", - " \"datasource\": np.nan,\n", - " \"auctioneerID\": np.nan,\n", - " \"YearMade\": 2004,\n", - " \"MachineHoursCurrentMeter\": 11770.0,\n", - " \"UsageBand\": \"High\",\n", - " \"fiModelDesc\": \"D6RXL\",\n", - " \"fiBaseModel\": \"D6\",\n", - " \"fiSecondaryDesc\": \"XL\",\n", - " \"fiModelSeries\": np.nan,\n", - " \"fiModelDescriptor\": np.nan,\n", - " \"ProductSize\": \"Medium\",\n", - " \"fiProductClassDesc\": \"Track Type Tractor, Dozer - 130.0 to 160.0 Horsepower\",\n", - " \"state\": \"Missouri\",\n", - " \"ProductGroup\": \"TTT\",\n", - " \"ProductGroupDesc\": \"Track Type Tractors\",\n", - " \"Drive_System\": \"No\",\n", - " \"Enclosure\": \"EROPS\",\n", - " \"Forks\": \"None or Unspecified\",\n", - " \"Pad_Type\": \"Grouser\",\n", - " \"Ride_Control\": \"None or Unspecified\",\n", - " \"Stick\": \"nan\",\n", - " \"Transmission\": \"Powershift\",\n", - " \"Turbocharged\": \"None or Unspecified\",\n", - " \"Blade_Extension\": \"None or Unspecified\",\n", - " \"Blade_Width\": np.nan,\n", - " \"Enclosure_Type\": np.nan,\n", - " \"Engine_Horsepower\": np.nan,\n", - " \"Hydraulics\": np.nan,\n", - " \"Pushblock\": \"None or Unspecified\",\n", - " \"Ripper\": \"None or Unspecified\",\n", - " \"Scarifier\": \"None or Unspecified\",\n", - " \"Tip_Control\": \"Tip\",\n", - " \"Tire_Size\": np.nan,\n", - " \"Coupler\": np.nan,\n", - " \"Coupler_System\": np.nan,\n", - " \"Grouser_Tracks\": \"Yes\",\n", - " \"Hydraulics_Flow\": np.nan,\n", - " \"Track_Type\": \"Steel\",\n", - " \"Undercarriage_Pad_Width\": \"22 inch\",\n", - " \"Stick_Length\": np.nan,\n", - " \"Thumb\": np.nan,\n", - " \"Pattern_Changer\": np.nan,\n", - " \"Grouser_Type\": \"Single\",\n", - " \"Backhoe_Mounting\": \"None or Unspecified\",\n", - " \"Blade_Type\": \"Semi U\",\n", - " \"Travel_Controls\": np.nan,\n", - " \"Differential_Type\": np.nan,\n", - " \"Steering_Controls\": \"Command Control\",\n", - " \"saleYear\": 2024,\n", - " \"saleMonth\": 6,\n", - " \"saleDay\": 7,\n", - " \"saleDayofweek\": 5,\n", - " \"saleDayofyear\": 159\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've got a single custom sample in the form of a dictionary, we can turn it into a DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 139, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
0NaN8504NaNNaNNaN200411770.0HighD6RXLD6...None or UnspecifiedSemi UNaNNaNCommand Control2024675159
\n", - "

1 rows × 56 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 NaN 8504 NaN NaN NaN 2004 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 11770.0 High D6RXL D6 ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "0 None or Unspecified Semi U NaN NaN \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek saleDayofyear \n", - "0 Command Control 2024 6 7 5 159 \n", - "\n", - "[1 rows x 56 columns]" - ] - }, - "execution_count": 139, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Turn single sample in a DataFrame\n", - "custom_sample_df = pd.DataFrame(custom_sample, index=[0])\n", - "custom_sample_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And of course, we can preprocess the categoricial features using our `ordinal_encoder` (we use the same instance of `OrdinalEncoder` as we trained on the training dataset)." - ] - }, - { - "cell_type": "code", - "execution_count": 140, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
0NaN8504NaNNaNNaN200411770.00.02308.0703.0...0.06.07.04.00.02024675159
\n", - "

1 rows × 56 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 NaN 8504 NaN NaN NaN 2004 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 11770.0 0.0 2308.0 703.0 ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "0 0.0 6.0 7.0 4.0 \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek \\\n", - "0 0.0 2024 6 7 5 \n", - "\n", - " saleDayofyear \n", - "0 159 \n", - "\n", - "[1 rows x 56 columns]" - ] - }, - "execution_count": 140, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Transform the categorical features of the custom sample\n", - "custom_sample_df[categorical_features] = ordinal_encoder.transform(custom_sample_df[categorical_features].astype(str))\n", - "custom_sample_df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Custom sample preprocessed, let's make a prediction!" - ] - }, - { - "cell_type": "code", - "execution_count": 141, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Predicted sale price of custom sample: $51474.96\n" - ] - } - ], - "source": [ - "# Make a prediction on the preprocessed custom sample\n", - "custom_sample_pred = best_model.predict(custom_sample_df)\n", - "print(f\"[INFO] Predicted sale price of custom sample: ${round(custom_sample_pred[0], 2)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now how close was this to the actual sale price (listed on the advertisement) of $72,600?" - ] - }, - { - "cell_type": "code", - "execution_count": 142, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model MAE on custom sample: 21125.040564373892\n", - "[INFO] Model RMSLE on custom sample: 0.3438638042344433\n" - ] - } - ], - "source": [ - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "\n", - "# Evaluate our model versus the actual sale price\n", - "custom_sample_actual_sale_price = [72600] # this is the sale price listed on the advertisement\n", - "\n", - "print(f\"[INFO] Model MAE on custom sample: {mean_absolute_error(y_pred=custom_sample_pred, y_true=custom_sample_actual_sale_price)}\")\n", - "print(f\"[INFO] Model RMSLE on custom sample: {root_mean_squared_log_error(y_pred=custom_sample_pred, y_true=custom_sample_actual_sale_price)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah!\n", - "\n", - "We get a quite high MAE value, however, it looks like our model's RMSLE performance on the custom sample was even better than the `best_model` on the validation dataset.\n", - "\n", - "Not too bad for a model trained on sales data over 12 years older than our custom sample's sale date.\n", - "\n", - "> **Note:** In practice, to make this process easier, rather than manually typing out all of the feature values by hand, you might want to create an application capable of ingesting these values in a nice user interface. To create such machine learning applications, I'd practice by checking out [Streamlit](https://streamlit.io/) or [Gradio](https://www.gradio.app/)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 9. Finding the most important predictive features\n", - "\n", - "Since we've built a model which is able to make predictions, the people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.\n", - "\n", - "This is where **feature importance** comes in. \n", - "\n", - "Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the **target variable**.\n", - "\n", - "In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?\n", - "\n", - "We can do this for our `sklearn.ensemble.RandomForestRegressor` instance using the [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_) attribute.\n", - "\n", - "Let's check it out." - ] - }, - { - "cell_type": "code", - "execution_count": 143, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([3.78948522e-02, 2.70954102e-02, 5.85804002e-02, 1.79438322e-03,\n", - " 5.25621132e-03, 1.92040831e-01, 6.71461619e-03, 1.42137572e-03,\n", - " 4.79324438e-02, 4.73967258e-02, 4.12235661e-02, 4.75379381e-03,\n", - " 2.55283197e-02, 1.60578799e-01, 5.08919397e-02, 8.34245434e-03,\n", - " 3.43077232e-03, 4.16871935e-03, 1.45645185e-03, 6.32089976e-02,\n", - " 1.93106853e-03, 7.91189110e-04, 2.16468186e-03, 2.42755109e-04,\n", - " 1.44729959e-03, 1.10292279e-04, 4.69525167e-03, 4.70046399e-03,\n", - " 2.18877572e-03, 4.03668217e-03, 4.46781002e-03, 2.86947732e-03,\n", - " 5.20668987e-03, 3.50894384e-03, 1.75215277e-03, 1.16769900e-02,\n", - " 1.84682779e-03, 2.08450645e-02, 1.17370327e-02, 5.26785421e-03,\n", - " 2.07101299e-03, 1.36424627e-03, 1.60680297e-03, 9.71604299e-04,\n", - " 7.85735364e-04, 7.29302663e-04, 6.74283032e-04, 3.28828690e-03,\n", - " 2.83781098e-03, 3.92432694e-04, 4.73081800e-04, 7.26042221e-02,\n", - " 5.42512552e-03, 8.54840059e-03, 4.42773246e-03, 1.26015531e-02])" - ] - }, - "execution_count": 143, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find feature importance of our best model\n", - "best_model_feature_importances = best_model.feature_importances_\n", - "best_model_feature_importances" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah, looks like we get one value per feature in our dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 144, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Number of feature importance values: 56\n", - "[INFO] Number of features in training dataset: 56\n" - ] - } - ], - "source": [ - "print(f\"[INFO] Number of feature importance values: {best_model_feature_importances.shape[0]}\") \n", - "print(f\"[INFO] Number of features in training dataset: {X_train_preprocessed.shape[1]}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can inspect these further by turning them into a DataFrame.\n", - "\n", - "We'll sort it descending order so we can see which feature our model is assigning the highest value." - ] - }, - { - "cell_type": "code", - "execution_count": 145, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
feature_namesfeature_importance
5YearMade0.192041
13ProductSize0.160579
51saleYear0.072604
19Enclosure0.063209
2ModelID0.058580
\n", - "
" - ], - "text/plain": [ - " feature_names feature_importance\n", - "5 YearMade 0.192041\n", - "13 ProductSize 0.160579\n", - "51 saleYear 0.072604\n", - "19 Enclosure 0.063209\n", - "2 ModelID 0.058580" - ] - }, - "execution_count": 145, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create feature importance DataFrame\n", - "column_names = test_df.columns\n", - "feature_importance_df = pd.DataFrame({\"feature_names\": column_names,\n", - " \"feature_importance\": best_model_feature_importances}).sort_values(by=\"feature_importance\",\n", - " ascending=False)\n", - "feature_importance_df.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmmm... looks like `YearMade` may be contributing the most value in the model's eyes.\n", - "\n", - "How about we turn our DataFrame into a plot to compare values?" - ] - }, - { - "cell_type": "code", - "execution_count": 146, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Plot the top feature importance values\n", - "top_n = 20\n", - "plt.figure(figsize=(10, 5))\n", - "plt.barh(y=feature_importance_df[\"feature_names\"][:top_n], # Plot the top_n feature importance values\n", - " width=feature_importance_df[\"feature_importance\"][:top_n])\n", - "plt.title(f\"Top {top_n} Feature Importance Values for Best RandomForestRegressor Model\")\n", - "plt.xlabel(\"Feature importance value\")\n", - "plt.ylabel(\"Feature name\")\n", - "plt.gca().invert_yaxis();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, looks like the top 4 features contributing to our model's predictions are `YearMade`, `ProductSize`, `Enclosure` and `saleYear`.\n", - "\n", - "Referring to the original [data dictionary](https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing), do these values make sense to be contributing the most to the model?\n", - "\n", - "* `YearMade` - Year of manufacture of the machine.\n", - "* `ProductSize` - Size of the bulldozer.\n", - "* `Enclosure` - Type of bulldozer enclosure (e.g. OROPS = Open Rollover Protective Structures, EROPS = Enclosed Rollover Protective Structures).\n", - "* `saleYear` - The year the bulldozer was sold (this is one of our engineered features from `saledate`).\n", - "\n", - "Now I've never sold a bulldozer but reading about each of these values seems to make sense that they would contribute significantly to the sale price.\n", - "\n", - "I know when I've bought cars in the past, the year that is was made was an important part of my decision. \n", - "\n", - "And it also makes sense that `ProductSize` be an important feature when deciding on the price of a bulldozer.\n", - "\n", - "Let's check out the unique values for `ProductSize` and `Enclosure`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 147, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Unique ProductSize values: ['Medium' nan 'Compact' 'Small' 'Large' 'Large / Medium' 'Mini']\n", - "[INFO] Unique Enclosure values: ['OROPS' 'EROPS' 'EROPS w AC' nan 'EROPS AC' 'NO ROPS'\n", - " 'None or Unspecified']\n" - ] - } - ], - "source": [ - "print(f\"[INFO] Unique ProductSize values: {train_df['ProductSize'].unique()}\")\n", - "print(f\"[INFO] Unique Enclosure values: {train_df['Enclosure'].unique()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "My guess is that a bulldozer with a `ProductSize` of `'Mini'` would sell for less than a bulldozer with a size of `'Large'`.\n", - "\n", - "We could investigate this further in an extension to model driven data exploratory analysis or we could take this information to a colleague or client to discuss further.\n", - "\n", - "Either way, we've now got a machine learning model capable of predicting the sale price of bulldozers given their features/attributes!\n", - "\n", - "That's a huuuuuuge effort!\n", - "\n", - "And you should be very proud of yourself for making it this far." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "We've covered a lot of ground.\n", - "\n", - "But there are some main takeaways to go over.\n", - "\n", - "* **Every machine learning problem is different** - Since machine learning is such a widespread technology, it can be used for a multitude of different problems. In saying this, there will often be many different ways to approach a problem. In this example, we've focused on predicting a number, which is a regression problem. And since our data had a time component, it could also be considered a time series problem.\n", - "* **The machine learner's motto: Experiment, experiment, experiment!** - Since there are many different ways to approach machine learning problems, one of the best habits you can develop is an experimental mindset. That means not being afraid to try new things over and over. Because the more things you try, the quicker you can figure what doesn't work and the quicker you can start to move towards what does.\n", - "* **Always keep the test set separate** - If you can't evaluate your model on unseen data, how would you know how it will perform in the real world on future unseen data? Of course, using a test set isn't a perfect replica of the real world but if it's done right, it can give you a good idea. Because evaluating a model is just as important as training a model.\n", - "* **If you've trained a model on a data in a certain format, you'll have to make predictions in the same format** - Any preprocessing you do to the training dataset, you'll have to do to the validation, test and custom data. Any computed values should happen on the training set only and then be used to update any subsequent datasets." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Exercises\n", - "\n", - "1. Fill the missing values in the numeric columns with the median using Scikit-Learn and see if that helps our best model's performance (hint: see [`sklearn.impute.SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for more).\n", - "2. Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's [`sklearn.pipeline.Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) features. \n", - "3. Try using another regression model/estimator on our preprocessed dataset and see how it goes. See the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/machine_learning_map.html) for potential model options.\n", - "4. Try replacing the `sklearn.preprocessing.OrdinalEncoder` we used for the categorical variables with `sklearn.preprocessing.OneHotEncoder` (you may even want to do this within a pipeline) with the `sklearn.ensemble.RandomForestRegressor` model and see how it performs. Which is better for our specific dataset? " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extra-curriculum\n", - "\n", - "The following resources are suggested extra reading and activities to add backing to the materials we've covered in this project.\n", - "\n", - "Reading documentation and knowing where to find information is one of the best skills you can develop as an engineer.\n", - "\n", - "* Read the pandas [IO tools documentation page](https://pandas.pydata.org/docs/user_guide/io.html#) for an idea of all the possible ways to get data in and out of pandas.\n", - "* See all of the [available datatypes in the pandas user guide](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes) (knowing what type your data is in can help prevent a lot of future errors).\n", - "* Read the Scikit-Learn [dataset transformations](https://scikit-learn.org/stable/data_transforms.html) and [data preprocessing guide](https://scikit-learn.org/stable/modules/preprocessing.html) for an overview of all the different ways you can preprocess and transform data. \n", - "* For more on saving and loading model objects with Scikit-Learn, see the documentation on [model persistence](https://scikit-learn.org/stable/model_persistence.html).\n", - "* For more on the importance of creating good validation and test sets, I'd recommend reading [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) by Rachel Thomas as well as [The importance of a test set](https://www.learnml.io/posts/the-importance-of-a-test-set/) by Daniel Bourke.\n", - "* We've covered a handful of models in the Scikit-Learn library, however, there are some other ML models which are worth exploring such as [CatBoost](https://catboost.ai/) and [XGBoost](https://xgboost.ai/). Both of these models can handle missing values and are often touted as some of the most performant ML models on the market. A good extension would be to try get one of them working on our bulldozer data.\n", - " * Bonus: You can also see a [list of models in Scikit-Learn which can handle missing/NaN values](https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values). " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Example Exercise Solutions\n", - "\n", - "The following are examples of how to solve the above exercises." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Fill the missing values in the numeric columns with the median using Scikit-Learn and see if that helps our best model's performance" - ] - }, - { - "cell_type": "code", - "execution_count": 244, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Number of samples in training DataFrame: 401125\n", - "[INFO] Number of samples in validation DataFrame: 11573\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1951.2971558280735),\n", - " 'Valid MAE': np.float64(5964.025764507629),\n", - " 'Training RMSLE': np.float64(0.101909965049995),\n", - " 'Valid RMSLE': np.float64(0.24697812443315573),\n", - " 'Training R^2': 0.9810825663665007,\n", - " 'Valid R^2': 0.8809697755766817}" - ] - }, - "execution_count": 244, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.impute import SimpleImputer\n", - "from sklearn.preprocessing import OrdinalEncoder\n", - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "\n", - "# Import train samples (making sure to parse dates and then sort by them)\n", - "train_df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/Train.csv\",\n", - " parse_dates=[\"saledate\"],\n", - " low_memory=False).sort_values(by=\"saledate\", ascending=True)\n", - "\n", - "# Import validation samples (making sure to parse dates and then sort by them)\n", - "valid_df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/Valid.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "\n", - "# The ValidSolution.csv contains the SalePrice values for the samples in Valid.csv\n", - "valid_solution = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/ValidSolution.csv\")\n", - "\n", - "# Map valid_solution to valid_df\n", - "valid_df[\"SalePrice\"] = valid_df[\"SalesID\"].map(valid_solution.set_index(\"SalesID\")[\"SalePrice\"])\n", - "\n", - "# Make sure valid_df is sorted by saledate still\n", - "valid_df = valid_df.sort_values(\"saledate\", ascending=True).reset_index(drop=True)\n", - "\n", - "# How many samples are in each DataFrame?\n", - "print(f\"[INFO] Number of samples in training DataFrame: {len(train_df)}\")\n", - "print(f\"[INFO] Number of samples in validation DataFrame: {len(valid_df)}\")\n", - "\n", - "# Make a function to add date columns\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " # Add datetime parameters for saledate\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - "\n", - " # Drop original saledate column\n", - " df.drop(\"saledate\", axis=1, inplace=True)\n", - "\n", - " return df\n", - "\n", - "# Add datetime features to DataFrames\n", - "train_df = add_datetime_features_to_df(df=train_df)\n", - "valid_df = add_datetime_features_to_df(df=valid_df)\n", - "\n", - "# Split training data into features and labels\n", - "X_train = train_df.drop(\"SalePrice\", axis=1)\n", - "y_train = train_df[\"SalePrice\"]\n", - "\n", - "# Split validation data into features and labels\n", - "X_valid = valid_df.drop(\"SalePrice\", axis=1)\n", - "y_valid = valid_df[\"SalePrice\"]\n", - "\n", - "# Define numerical and categorical features\n", - "numeric_features = [label for label, content in X_train.items() if pd.api.types.is_numeric_dtype(content)]\n", - "categorical_features = [label for label, content in X_train.items() if not pd.api.types.is_numeric_dtype(content)]\n", - "\n", - "### Filling missing values ### \n", - "\n", - "# Create an ordinal encoder (turns category items into numeric representation)\n", - "ordinal_encoder = OrdinalEncoder(categories=\"auto\",\n", - " handle_unknown=\"use_encoded_value\",\n", - " unknown_value=np.nan,\n", - " encoded_missing_value=np.nan) # treat unknown categories as np.nan (or None)\n", - "\n", - "# Create a simple imputer to fill missing values with median\n", - "simple_imputer_median = SimpleImputer(missing_values=np.nan,\n", - " strategy=\"median\")\n", - "\n", - "# Fit and transform the categorical and numerical columns of X_train\n", - "X_train_preprocessed = X_train.copy() # make copies of the oringal DataFrames so we can keep the original values in tact and view them later\n", - "X_train_preprocessed[categorical_features] = ordinal_encoder.fit_transform(X_train_preprocessed[categorical_features].astype(str)) # OrdinalEncoder expects all values as the same type (e.g. string or numeric only)\n", - "X_train_preprocessed[numerical_features] = simple_imputer_median.fit_transform(X_train_preprocessed[numerical_features])\n", - "\n", - "# Transform the categorical and numerical columns of X_valid \n", - "X_valid_preprocessed = X_valid.copy()\n", - "X_valid_preprocessed[categorical_features] = ordinal_encoder.transform(X_valid_preprocessed[categorical_features].astype(str)) # only use `transform` on the validation data\n", - "X_valid_preprocessed[numerical_features] = simple_imputer_median.transform(X_valid_preprocessed[numerical_features])\n", - "\n", - "# Create function to evaluate our model\n", - "def show_scores(model, \n", - " train_features=X_train_preprocessed,\n", - " train_labels=y_train,\n", - " valid_features=X_valid_preprocessed,\n", - " valid_labels=y_valid):\n", - " \n", - " # Make predictions on train and validation features\n", - " train_preds = model.predict(X=train_features)\n", - " val_preds = model.predict(X=valid_features)\n", - "\n", - " # Create a scores dictionary of different evaluation metrics\n", - " scores = {\"Training MAE\": mean_absolute_error(y_true=train_labels, \n", - " y_pred=train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_true=valid_labels, \n", - " y_pred=val_preds),\n", - " \"Training RMSLE\": root_mean_squared_log_error(y_true=train_labels, \n", - " y_pred=train_preds),\n", - " \"Valid RMSLE\": root_mean_squared_log_error(y_true=valid_labels, \n", - " y_pred=val_preds),\n", - " \"Training R^2\": model.score(X=train_features, \n", - " y=train_labels),\n", - " \"Valid R^2\": model.score(X=valid_features, \n", - " y=valid_labels)}\n", - " return scores\n", - "\n", - "# Instantiate a model with best hyperparameters \n", - "ideal_model_2 = RandomForestRegressor(n_estimators=90,\n", - " max_depth=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=5,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None)\n", - "\n", - "# Fit a model to the preprocessed data\n", - "ideal_model_2.fit(X=X_train_preprocessed, \n", - " y=y_train)\n", - "\n", - "# Evalute the model\n", - "ideal_model_2_scores = show_scores(model=ideal_model_2)\n", - "ideal_model_2_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Looks like filling the missing numeric values made our `ideal_model_2` perform slightly worse than our original `ideal_model`.\n", - "\n", - "`ideal_model_2` had a validation RMSLE of `0.24697812443315573` where as `ideal_model` had a validation RMSLE of `0.24654150224930685`." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2. Try putting multiple steps together (e.g. preprocessing -> modelling) with Scikit-Learn's `sklearn.pipeline.Pipeline`" - ] - }, - { - "cell_type": "code", - "execution_count": 247, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Pipeline Scores:\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(1951.4776197781914),\n", - " 'Valid MAE': np.float64(5974.931566226864),\n", - " 'Training RMSLE': np.float64(0.10196097739473307),\n", - " 'Valid RMSLE': np.float64(0.24760612684722114),\n", - " 'Training R^2': 0.9811027965058758,\n", - " 'Valid R^2': 0.8807288353268701}" - ] - }, - "execution_count": 247, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer\n", - "from sklearn.pipeline import Pipeline\n", - "\n", - "\n", - "# Import and prepare data\n", - "train_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Train.csv\",\n", - " parse_dates=[\"saledate\"],\n", - " low_memory=False).sort_values(by=\"saledate\", ascending=True)\n", - "\n", - "valid_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Valid.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "\n", - "valid_solution = pd.read_csv(\"../data/bluebook-for-bulldozers/ValidSolution.csv\")\n", - "valid_df[\"SalePrice\"] = valid_df[\"SalesID\"].map(valid_solution.set_index(\"SalesID\")[\"SalePrice\"])\n", - "valid_df = valid_df.sort_values(\"saledate\", ascending=True).reset_index(drop=True)\n", - "\n", - "# Add datetime features\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " df = df.copy()\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - " return df.drop(date_column, axis=1)\n", - "\n", - "# Apply datetime features\n", - "train_df = add_datetime_features_to_df(train_df)\n", - "valid_df = add_datetime_features_to_df(valid_df)\n", - "\n", - "# Split data into features and labels\n", - "X_train = train_df.drop(\"SalePrice\", axis=1)\n", - "y_train = train_df[\"SalePrice\"]\n", - "X_valid = valid_df.drop(\"SalePrice\", axis=1)\n", - "y_valid = valid_df[\"SalePrice\"]\n", - "\n", - "# Define feature types\n", - "numeric_features = [label for label, content in X_train.items() \n", - " if pd.api.types.is_numeric_dtype(content)]\n", - "categorical_features = [label for label, content in X_train.items() \n", - " if not pd.api.types.is_numeric_dtype(content)]\n", - "\n", - "# Create preprocessing steps\n", - "numeric_transformer = Pipeline(steps=[\n", - " ('imputer', SimpleImputer(strategy='median'))\n", - "])\n", - "\n", - "categorical_transformer = Pipeline(steps=[\n", - " ('string_converter', FunctionTransformer(lambda x: x.astype(str))), # convert values to string\n", - " ('ordinal', OrdinalEncoder(handle_unknown='use_encoded_value',\n", - " unknown_value=np.nan,\n", - " encoded_missing_value=np.nan)),\n", - "])\n", - "\n", - "# Create preprocessor using ColumnTransformer\n", - "preprocessor = ColumnTransformer(\n", - " transformers=[\n", - " ('numerical_transforms', numeric_transformer, numeric_features),\n", - " ('categorical_transforms', categorical_transformer, categorical_features)\n", - " ])\n", - "\n", - "# Create full pipeline\n", - "model_pipeline = Pipeline([\n", - " ('preprocessor', preprocessor),\n", - " ('regressor', RandomForestRegressor(\n", - " n_estimators=90,\n", - " max_depth=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=5,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None\n", - " ))\n", - "])\n", - "\n", - "# Function to evaluate the pipeline\n", - "def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):\n", - " # Make predictions\n", - " train_preds = pipeline.predict(X_train)\n", - " valid_preds = pipeline.predict(X_valid)\n", - " \n", - " # Calculate scores\n", - " scores = {\n", - " \"Training MAE\": mean_absolute_error(y_train, train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_valid, valid_preds),\n", - " \"Training RMSLE\": root_mean_squared_log_error(y_train, train_preds),\n", - " \"Valid RMSLE\": root_mean_squared_log_error(y_valid, valid_preds),\n", - " \"Training R^2\": pipeline.score(X_train, y_train),\n", - " \"Valid R^2\": pipeline.score(X_valid, y_valid)\n", - " }\n", - " return scores\n", - "\n", - "# Fit and evaluate pipeline\n", - "model_pipeline.fit(X_train, y_train)\n", - "pipeline_scores = evaluate_pipeline(model_pipeline, X_train, y_train, X_valid, y_valid)\n", - "print(\"\\nPipeline Scores:\")\n", - "pipeline_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Try using another regression model/estimator on our preprocessed dataset and see how it goes\n", - "\n", - "Going to use [`sklearn.linear_model.HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#histgradientboostingregressor)." - ] - }, - { - "cell_type": "code", - "execution_count": 148, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Fitting HistGradientBoostingRegressor model with pipeline...\n", - "[INFO] Evaluating HistGradientBoostingRegressor model with pipeline...\n", - "\n", - "Pipeline HistGradientBoostingRegressor Scores:\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(5638.6121797753785),\n", - " 'Valid MAE': np.float64(7264.258786098576),\n", - " 'Training RMSLE': np.float64(0.2691456681483351),\n", - " 'Valid RMSLE': np.float64(0.30482586120872424),\n", - " 'Training R^2': 0.8646511348082063,\n", - " 'Valid R^2': 0.8319021596407035}" - ] - }, - "execution_count": 148, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "import numpy as np\n", - "\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.ensemble import HistGradientBoostingRegressor\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "from sklearn.preprocessing import OrdinalEncoder, FunctionTransformer, StandardScaler\n", - "from sklearn.pipeline import Pipeline\n", - "\n", - "# Import and prepare data\n", - "train_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Train.csv\",\n", - " parse_dates=[\"saledate\"],\n", - " low_memory=False).sort_values(by=\"saledate\", ascending=True)\n", - "\n", - "valid_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Valid.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "\n", - "valid_solution = pd.read_csv(\"../data/bluebook-for-bulldozers/ValidSolution.csv\")\n", - "valid_df[\"SalePrice\"] = valid_df[\"SalesID\"].map(valid_solution.set_index(\"SalesID\")[\"SalePrice\"])\n", - "valid_df = valid_df.sort_values(\"saledate\", ascending=True).reset_index(drop=True)\n", - "\n", - "# Add datetime features\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " df = df.copy()\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - " return df.drop(date_column, axis=1)\n", - "\n", - "# Apply datetime features\n", - "train_df = add_datetime_features_to_df(train_df)\n", - "valid_df = add_datetime_features_to_df(valid_df)\n", - "\n", - "# Split data into features and labels\n", - "X_train = train_df.drop(\"SalePrice\", axis=1)\n", - "y_train = train_df[\"SalePrice\"]\n", - "X_valid = valid_df.drop(\"SalePrice\", axis=1)\n", - "y_valid = valid_df[\"SalePrice\"]\n", - "\n", - "# Define feature types\n", - "numeric_features = [label for label, content in X_train.items() \n", - " if pd.api.types.is_numeric_dtype(content)]\n", - "categorical_features = [label for label, content in X_train.items() \n", - " if not pd.api.types.is_numeric_dtype(content)]\n", - "\n", - "# Create preprocessing steps for different types of values\n", - "numeric_transformer = Pipeline(steps=[\n", - " ('imputer', SimpleImputer(strategy='median')),\n", - "])\n", - "\n", - "categorical_transformer = Pipeline(steps=[\n", - " ('string_converter', FunctionTransformer(lambda x: x.astype(str))), # convert values to string\n", - " ('ordinal', OrdinalEncoder(categories='auto',\n", - " handle_unknown='use_encoded_value',\n", - " unknown_value=np.nan,\n", - " encoded_missing_value=np.nan)), \n", - "])\n", - "\n", - "# Create preprocessor using ColumnTransformer\n", - "preprocessor = ColumnTransformer(\n", - " transformers=[\n", - " ('numerical_transforms', numeric_transformer, numeric_features),\n", - " ('categorical_transforms', categorical_transformer, categorical_features)\n", - " ])\n", - "\n", - "# Create full pipeline\n", - "model_pipeline_hist_gradient_boosting_regressor = Pipeline([\n", - " ('preprocessor', preprocessor),\n", - " ('regressor', HistGradientBoostingRegressor()) # Change model to HistGradientBoostingRegressor\n", - "])\n", - "\n", - "# Function to evaluate the pipeline\n", - "def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):\n", - " # Make predictions\n", - " train_preds = pipeline.predict(X_train)\n", - " valid_preds = pipeline.predict(X_valid)\n", - " \n", - " # Calculate scores\n", - " scores = {\n", - " \"Training MAE\": mean_absolute_error(y_train, train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_valid, valid_preds),\n", - " \"Training RMSLE\": root_mean_squared_log_error(y_train, train_preds),\n", - " \"Valid RMSLE\": root_mean_squared_log_error(y_valid, valid_preds),\n", - " \"Training R^2\": pipeline.score(X_train, y_train),\n", - " \"Valid R^2\": pipeline.score(X_valid, y_valid)\n", - " }\n", - " return scores\n", - "\n", - "# Fit and evaluate pipeline\n", - "print(f\"[INFO] Fitting HistGradientBoostingRegressor model with pipeline...\")\n", - "model_pipeline_hist_gradient_boosting_regressor.fit(X_train, y_train)\n", - "print(f\"[INFO] Evaluating HistGradientBoostingRegressor model with pipeline...\")\n", - "pipeline_hist_scores = evaluate_pipeline(model_pipeline_hist_gradient_boosting_regressor, X_train, y_train, X_valid, y_valid)\n", - "print(\"\\nPipeline HistGradientBoostingRegressor Scores:\")\n", - "pipeline_hist_scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4. Try replacing the `sklearn.preprocessing.OrdinalEncoder` we used for the categorical variables with `sklearn.preprocessing.OneHotEncoder`\n", - "\n", - "> **Note:** This may take quite a long time depending on your machine. For example, on my MacBook Pro M1 Pro it took ~10 minutes with `n_estimators=10` (9x lower than what we used for our `best_model`). This is because using `sklearn.preprocessing.OneHotEncoder` adds many more features to our dataset (each feature gets turned into an array of 0's and 1's for each unique value). And the more features, the longer it takes to compute and find patterns between them." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Fitting model with one hot encoded values...\n", - "[INFO] Evaluating model with one hot encoded values...\n", - "[INFO] Pipeline with one hot encoding scores:\n", - "CPU times: user 29min, sys: 23min 12s, total: 52min 13s\n", - "Wall time: 9min 14s\n" - ] - }, - { - "data": { - "text/plain": [ - "{'Training MAE': np.float64(2133.748251811842),\n", - " 'Valid MAE': np.float64(6176.810802667383),\n", - " 'Training RMSLE': np.float64(0.11021214524792695),\n", - " 'Valid RMSLE': np.float64(0.2539881442090813),\n", - " 'Training R^2': 0.9759312990258391,\n", - " 'Valid R^2': 0.870741470996933}" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "import pandas as pd\n", - "import numpy as np\n", - "from sklearn.impute import SimpleImputer\n", - "from sklearn.preprocessing import OneHotEncoder\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "from sklearn.metrics import mean_absolute_error, root_mean_squared_log_error\n", - "from sklearn.pipeline import Pipeline\n", - "from sklearn.compose import ColumnTransformer\n", - "from sklearn.preprocessing import FunctionTransformer\n", - "\n", - "# Import and prepare data\n", - "train_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Train.csv\",\n", - " parse_dates=[\"saledate\"],\n", - " low_memory=False).sort_values(by=\"saledate\", ascending=True)\n", - "\n", - "valid_df = pd.read_csv(\"../data/bluebook-for-bulldozers/Valid.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "\n", - "valid_solution = pd.read_csv(\"../data/bluebook-for-bulldozers/ValidSolution.csv\")\n", - "valid_df[\"SalePrice\"] = valid_df[\"SalesID\"].map(valid_solution.set_index(\"SalesID\")[\"SalePrice\"])\n", - "valid_df = valid_df.sort_values(\"saledate\", ascending=True).reset_index(drop=True)\n", - "\n", - "# Add datetime features\n", - "def add_datetime_features_to_df(df, date_column=\"saledate\"):\n", - " df = df.copy()\n", - " df[\"saleYear\"] = df[date_column].dt.year\n", - " df[\"saleMonth\"] = df[date_column].dt.month\n", - " df[\"saleDay\"] = df[date_column].dt.day\n", - " df[\"saleDayofweek\"] = df[date_column].dt.dayofweek\n", - " df[\"saleDayofyear\"] = df[date_column].dt.dayofyear\n", - " return df.drop(date_column, axis=1)\n", - "\n", - "# Apply datetime features\n", - "train_df = add_datetime_features_to_df(train_df)\n", - "valid_df = add_datetime_features_to_df(valid_df)\n", - "\n", - "# Split data\n", - "X_train = train_df.drop(\"SalePrice\", axis=1)\n", - "y_train = train_df[\"SalePrice\"]\n", - "X_valid = valid_df.drop(\"SalePrice\", axis=1)\n", - "y_valid = valid_df[\"SalePrice\"]\n", - "\n", - "# Define feature types\n", - "numeric_features = [label for label, content in X_train.items() \n", - " if pd.api.types.is_numeric_dtype(content)]\n", - "categorical_features = [label for label, content in X_train.items() \n", - " if not pd.api.types.is_numeric_dtype(content)]\n", - "\n", - "# Create preprocessing steps\n", - "numeric_transformer = Pipeline(steps=[\n", - " ('imputer', SimpleImputer(strategy='median'))\n", - "])\n", - "\n", - "categorical_transformer = Pipeline(steps=[\n", - " ('string_converter', FunctionTransformer(lambda x: x.astype(str))),\n", - " ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), # fill missing values with the term \"missing\"\n", - " ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True)) # use OneHotEncoder instead of OrdinalEncoder\n", - "])\n", - "\n", - "# Create preprocessor using ColumnTransformer\n", - "preprocessor = ColumnTransformer(\n", - " transformers=[\n", - " ('num', numeric_transformer, numeric_features),\n", - " ('cat', categorical_transformer, categorical_features)\n", - " ],\n", - " verbose_feature_names_out=False # Simplify feature names\n", - ")\n", - "\n", - "# Create full pipeline\n", - "model_one_hot_pipeline = Pipeline([\n", - " ('preprocessor', preprocessor),\n", - " ('regressor', RandomForestRegressor(\n", - " n_estimators=10,\n", - " max_depth=None,\n", - " min_samples_leaf=1,\n", - " min_samples_split=5,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None\n", - " ))\n", - "])\n", - "\n", - "# Function to evaluate the pipeline\n", - "def evaluate_pipeline(pipeline, X_train, y_train, X_valid, y_valid):\n", - " # Make predictions\n", - " train_preds = pipeline.predict(X_train)\n", - " valid_preds = pipeline.predict(X_valid)\n", - " \n", - " # Calculate scores\n", - " scores = {\n", - " \"Training MAE\": mean_absolute_error(y_train, train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_valid, valid_preds),\n", - " \"Training RMSLE\": root_mean_squared_log_error(y_train, train_preds),\n", - " \"Valid RMSLE\": root_mean_squared_log_error(y_valid, valid_preds),\n", - " \"Training R^2\": pipeline.score(X_train, y_train),\n", - " \"Valid R^2\": pipeline.score(X_valid, y_valid)\n", - " }\n", - " return scores\n", - "\n", - "# Fit and evaluate pipeline\n", - "print(f\"[INFO] Fitting model with one hot encoded values...\")\n", - "model_one_hot_pipeline.fit(X_train, y_train)\n", - "print(f\"[INFO] Evaluating model with one hot encoded values...\")\n", - "pipeline_one_hot_scores = evaluate_pipeline(model_one_hot_pipeline, X_train, y_train, X_valid, y_valid)\n", - "print(\"[INFO] Pipeline with one hot encoding scores:\")\n", - "pipeline_one_hot_scores" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Next:\n", - "# Go through TK's" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression.ipynb b/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression.ipynb deleted file mode 100644 index 11f287777..000000000 --- a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression.ipynb +++ /dev/null @@ -1,7331 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "TK - add Google Colab link as well as reference notebook etc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning \n", - "\n", - "In this notebook, we're going to go through an example machine learning project to use the characteristics of bulldozers and their past sales prices to predict the sale price of future bulldozers based on their characteristics.\n", - "\n", - "* **Inputs:** Bulldozer characteristics such as make year, base model, model series, state of sale (e.g. which US state was it sold in), drive system and more.\n", - "* **Outputs:** Bulldozer sale price (in USD).\n", - "\n", - "Since we're trying to predict a number, this kind of problem is known as a **regression problem**.\n", - "\n", - "And since we're going to predicting results with a time component (predicting future sales based on past sales), this is also known as a **time series** or **forecasting** problem.\n", - "\n", - "The data and evaluation metric we'll be using (root mean square log error or RMSLE) is from the [Kaggle Bluebook for Bulldozers competition](https://www.kaggle.com/c/bluebook-for-bulldozers/overview).\n", - "\n", - "The techniques used in here have been inspired and adapted from [the fast.ai machine learning course](https://course18.fast.ai/ml)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "Since we already have a dataset, we'll approach the problem with the following machine learning modelling framework.\n", - "\n", - "| | \n", - "|:--:| \n", - "| 6 Step Machine Learning Modelling Framework ([read more](https://whimsical.com/9g65jgoRYTxMXxDosndYTB)) |\n", - "\n", - "To work through these topics, we'll use pandas, Matplotlib and NumPy for data analysis, as well as, Scikit-Learn for machine learning and modelling tasks.\n", - "\n", - "| | \n", - "|:--:| \n", - "| Tools that can be used for each step of the machine learning modelling process. |\n", - "\n", - "We'll work through each step and by the end of the notebook, we'll have a trained machine learning model which predicts the sale price of a bulldozer given different characteristics about it." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 6 Step Machine Learning Framework\n", - "\n", - "#### 1. Problem Definition\n", - "\n", - "For this dataset, the problem we're trying to solve, or better, the question we're trying to answer is,\n", - "\n", - "> How well can we predict the future sale price of a bulldozer, given its characteristics previous examples of how much similar bulldozers have been sold for?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 2. Data\n", - "\n", - "Looking at the [dataset from Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data), you can you it's a time series problem. This means there's a time attribute to dataset.\n", - "\n", - "In this case, it's historical sales data of bulldozers. Including things like, model type, size, sale date and more.\n", - "\n", - "There are 3 datasets:\n", - "\n", - "1. **Train.csv** - Historical bulldozer sales examples up to 2011 (close to 400,000 examples with 50+ different attributes, including `SalePrice` which is the **target variable**).\n", - "2. **Valid.csv** - Historical bulldozer sales examples from January 1 2012 to April 30 2012 (close to 12,000 examples with the same attributes as **Train.csv**).\n", - "3. **Test.csv** - Historical bulldozer sales examples from May 1 2012 to November 2012 (close to 12,000 examples but missing the `SalePrice` attribute, as this is what we'll be trying to predict).\n", - "\n", - "> **Note:** You can download the dataset `bluebook-for-bulldozers` dataset directly from Kaggle. Alternatively, you can also [download it directly from the course GitHub](https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 3. Evaluation\n", - "\n", - "For this problem, [Kaggle has set the evaluation metric to being root mean squared log error (RMSLE)](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation). As with many regression evaluations, the goal will be to get this value as low as possible.\n", - "\n", - "To see how well our model is doing, we'll calculate the RMSLE and then compare our results to others on the [Kaggle leaderboard](https://www.kaggle.com/c/bluebook-for-bulldozers/leaderboard)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### 4. Features\n", - "\n", - "Features are different parts of the data. During this step, you'll want to start finding out what you can about the data.\n", - "\n", - "One of the most common ways to do this is to create a **data dictionary**.\n", - "\n", - "For this dataset, Kaggle provides a data dictionary which contains information about what each attribute of the dataset means. \n", - "\n", - "For example: \n", - "\n", - "| Variable Name | Description | Variable Type |\n", - "|------|-----|-----|\n", - "| SalesID | unique identifier of a particular sale of a machine at auction | Independent variable |\n", - "| MachineID | identifier for a particular machine; machines may have multiple sales | Independent variable |\n", - "| ModelID | identifier for a unique machine model (i.e. fiModelDesc) | Independent variable |\n", - "| datasource | source of the sale record; some sources are more diligent about reporting attributes of the machine than others. Note that a particular datasource may report on multiple auctioneerIDs. | Independent variable |\n", - "| auctioneerID | identifier of a particular auctioneer, i.e. company that sold the machine at auction. Not the same as datasource. | Independent variable |\n", - "| YearMade | year of manufacturer of the Machine | Independent variable |\n", - "| MachineHoursCurrentMeter | current usage of the machine in hours at time of sale (saledate); null or 0 means no hours have been reported for that sale | Independent variable |\n", - "| UsageBand | value (low, medium, high) calculated comparing this particular Machine-Sale hours to average usage for the fiBaseModel; e.g. 'Low' means this machine has fewer hours given its lifespan relative to the average of fiBaseModel. | Independent variable |\n", - "| Saledate | time of sale | Independent variable |\n", - "| fiModelDesc | Description of a unique machine model (see ModelID); concatenation of fiBaseModel & fiSecondaryDesc & fiModelSeries & fiModelDescriptor | Independent variable |\n", - "| State | US State in which sale occurred | Independent variable |\n", - "| Drive_System | machine configuration; typically describes whether 2 or 4 wheel drive | Independent variable |\n", - "| Enclosure | machine configuration - does the machine have an enclosed cab or not | Independent variable |\n", - "| Forks | machine configuration - attachment used for lifting | Independent variable |\n", - "| Pad_Type | machine configuration - type of treads a crawler machine uses | Independent variable |\n", - "| Ride_Control | machine configuration - optional feature on loaders to make the ride smoother | Independent variable |\n", - "| Transmission | machine configuration - describes type of transmission; typically automatic or manual | Independent variable |\n", - "| ... | ... | ... |\n", - "| SalePrice | cost of sale in USD | Target/dependent variable | \n", - "\n", - "\n", - "You can download the full version of this file directly from the [Kaggle competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/download/Bnl6RAHA0enbg0UfAvGA%2Fversions%2FwBG4f35Q8mAbfkzwCeZn%2Ffiles%2FData%20Dictionary.xlsx) (account required) or view it [on Google Sheets](https://docs.google.com/spreadsheets/d/18ly-bLR8sbDJLITkWG7ozKm8l3RyieQ2Fpgix-beSYI/edit?usp=sharing).\n", - "\n", - "With all of this being known, let's get started! \n", - "\n", - "First, we'll import the dataset and start exploring. Since we know the evaluation metric we're trying to minimise, our first goal will be building a baseline model and seeing how it stacks up against the competition." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Notebook last run (end-to-end): 2024-10-21 14:31:32.217669\n" - ] - } - ], - "source": [ - "# Timestamp\n", - "import datetime\n", - "\n", - "import datetime\n", - "print(f\"Notebook last run (end-to-end): {datetime.datetime.now()}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Importing the data and preparing it for modelling\n", - "\n", - "First thing is first, let's get the libraries we need imported and the data we'll need for the project.\n", - "\n", - "We'll start by importing pandas, NumPy and matplotlib." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "pandas version: 2.2.2\n", - "NumPy version: 2.1.1\n", - "matplotlib version: 3.9.2\n" - ] - } - ], - "source": [ - "# Import data analysis tools \n", - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib\n", - "import matplotlib.pyplot as plt\n", - "\n", - "# Print the versions we're using (as long as your versions are equal or higher than these, the code should work)\n", - "print(f\"pandas version: {pd.__version__}\")\n", - "print(f\"NumPy version: {np.__version__}\")\n", - "print(f\"matplotlib version: {matplotlib.__version__}\") " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've got our tools for data analysis ready, we can import the data and start to explore it.\n", - "\n", - "For this project, I've [downloaded the data from Kaggle](https://www.kaggle.com/c/bluebook-for-bulldozers/data) and stored it on the [course GitHub](https://github.com/mrdbourke/zero-to-mastery-ml/) under the file path [`../data/bluebook-for-bulldozers`](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/data/bluebook-for-bulldozers.zip).\n", - "\n", - "We can write some code to check if the files are available locally (on our computer) and if not, we can download them.\n", - "\n", - "> **Note:** If you're running this notebook on Google Colab, the code below will enable you to download the dataset programmatically. Just beware that each time Google Colab shuts down, the data will have to be redownloaded. There's also an [example Google Colab notebook](https://colab.research.google.com/drive/1hf1rTcCAQP1EN8pZ0ZIqjjEy47dwzbiv?usp=sharing) showing how to download the data programmatically." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!\n", - "[INFO] Current dataset dir: ../data/bluebook-for-bulldozers\n" - ] - } - ], - "source": [ - "from pathlib import Path\n", - "\n", - "# Check if 'bluebook-for-bulldozers' exists in the current or parent directory\n", - "# Link to data (see the file \"bluebook-for-bulldozers\"): https://github.com/mrdbourke/zero-to-mastery-ml/tree/master/data\n", - "dataset_dir = Path(\"../data/bluebook-for-bulldozers\")\n", - "if not (dataset_dir.is_dir()):\n", - " print(f\"[INFO] Can't find existing 'bluebook-for-bulldozers' dataset in current directory or parent directory, downloading...\")\n", - "\n", - " # Download and unzip the bluebook for bulldozers dataset\n", - " !wget https://github.com/mrdbourke/zero-to-mastery-ml/raw/refs/heads/master/data/bluebook-for-bulldozers.zip\n", - " !unzip bluebook-for-bulldozers.zip\n", - "\n", - " # Ensure a data directory exists and move the downloaded dataset there\n", - " !mkdir ../data/\n", - " !mv bluebook-for-bulldozers ../data/\n", - " print(f\"[INFO] Current dataset dir: {dataset_dir}\")\n", - "\n", - " # Remove .zip file from notebook directory\n", - " !rm -rf bluebook-for-bulldozers.zip\n", - "else:\n", - " # If the target dataset directory exists, we don't need to download it\n", - " print(f\"[INFO] 'bluebook-for-bulldozers' dataset exists, feel free to proceed!\")\n", - " print(f\"[INFO] Current dataset dir: {dataset_dir}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Dataset downloaded!\n", - "\n", - "Let's check what files are available." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Files/folders available in ../data/bluebook-for-bulldozers:\n" - ] - }, - { - "data": { - "text/plain": [ - "['random_forest_benchmark_test.csv',\n", - " 'Valid.csv',\n", - " 'median_benchmark.csv',\n", - " 'Valid.zip',\n", - " 'TrainAndValid.7z',\n", - " 'Test.csv',\n", - " 'Train.7z',\n", - " 'test_predictions.csv',\n", - " 'ValidSolution.csv',\n", - " 'train_tmp.csv',\n", - " 'Machine_Appendix.csv',\n", - " 'Train.csv',\n", - " 'Valid.7z',\n", - " 'Data Dictionary.xlsx',\n", - " 'TrainAndValid.csv',\n", - " 'Train.zip',\n", - " 'TrainAndValid.zip']" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import os\n", - "\n", - "print(f\"[INFO] Files/folders available in {dataset_dir}:\")\n", - "os.listdir(dataset_dir)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can explore each of these files individually or read about them on the [Kaggle Competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/data).\n", - "\n", - "For now, the main file we're interested in is `TrainAndValid.csv` (this is also a combination of `Train.csv` and `Valid.csv`), this is a combination of the training and validation datasets.\n", - "\n", - "* The training data (`Train.csv`) contains sale data from 1989 up to the end of 2011.\n", - "* The validation data (`Valid.csv`) contains sale data from January 1, 2012 - April 30, 2012.\n", - "* The test data (`Test.csv`) contains sale data from May 1, 2012 - November 2012.\n", - "\n", - "We'll use the training data to train our model to predict the sale price of bulldozers, we'll then validate its performance on the validation data to see if our model can be improved in any way. Finally, we'll evaluate our best model on the test dataset.\n", - "\n", - "But more on this later on.\n", - "\n", - "Let's import the `TrainAndValid.csv` file and turn it into a pandas DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_21543/1127193594.py:2: DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False.\n", - " df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")\n" - ] - } - ], - "source": [ - "# Import the training and validation set\n", - "df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful! We've got our DataFrame ready to explore.\n", - "\n", - "You might see a warning appear in the form:\n", - "\n", - "`DtypeWarning: Columns (13,39,40,41) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(\"../data/bluebook-for-bulldozers/TrainAndValid.csv\")`\n", - "\n", - "This is just saying that some of our columns have multiple/mixed data types. For example, a column may contain strings but also contain integers. This is okay for now and can be addressed later on if necessary.\n", - "\n", - "How about we get some information about our DataFrame?\n" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 53 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 saledate 412698 non-null object \n", - " 10 fiModelDesc 412698 non-null object \n", - " 11 fiBaseModel 412698 non-null object \n", - " 12 fiSecondaryDesc 271971 non-null object \n", - " 13 fiModelSeries 58667 non-null object \n", - " 14 fiModelDescriptor 74816 non-null object \n", - " 15 ProductSize 196093 non-null object \n", - " 16 fiProductClassDesc 412698 non-null object \n", - " 17 state 412698 non-null object \n", - " 18 ProductGroup 412698 non-null object \n", - " 19 ProductGroupDesc 412698 non-null object \n", - " 20 Drive_System 107087 non-null object \n", - " 21 Enclosure 412364 non-null object \n", - " 22 Forks 197715 non-null object \n", - " 23 Pad_Type 81096 non-null object \n", - " 24 Ride_Control 152728 non-null object \n", - " 25 Stick 81096 non-null object \n", - " 26 Transmission 188007 non-null object \n", - " 27 Turbocharged 81096 non-null object \n", - " 28 Blade_Extension 25983 non-null object \n", - " 29 Blade_Width 25983 non-null object \n", - " 30 Enclosure_Type 25983 non-null object \n", - " 31 Engine_Horsepower 25983 non-null object \n", - " 32 Hydraulics 330133 non-null object \n", - " 33 Pushblock 25983 non-null object \n", - " 34 Ripper 106945 non-null object \n", - " 35 Scarifier 25994 non-null object \n", - " 36 Tip_Control 25983 non-null object \n", - " 37 Tire_Size 97638 non-null object \n", - " 38 Coupler 220679 non-null object \n", - " 39 Coupler_System 44974 non-null object \n", - " 40 Grouser_Tracks 44875 non-null object \n", - " 41 Hydraulics_Flow 44875 non-null object \n", - " 42 Track_Type 102193 non-null object \n", - " 43 Undercarriage_Pad_Width 102916 non-null object \n", - " 44 Stick_Length 102261 non-null object \n", - " 45 Thumb 102332 non-null object \n", - " 46 Pattern_Changer 102261 non-null object \n", - " 47 Grouser_Type 102193 non-null object \n", - " 48 Backhoe_Mounting 80712 non-null object \n", - " 49 Blade_Type 81875 non-null object \n", - " 50 Travel_Controls 81877 non-null object \n", - " 51 Differential_Type 71564 non-null object \n", - " 52 Steering_Controls 71522 non-null object \n", - "dtypes: float64(3), int64(5), object(45)\n", - "memory usage: 166.9+ MB\n" - ] - } - ], - "source": [ - "# Get info about DataFrame\n", - "df.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! Over 400,000 entries!\n", - "\n", - "That's a much larger dataset than what we've worked with before.\n", - "\n", - "One thing you might have noticed is that the `saledate` column value is being treated as a Python object (it's okay if you didn't notice, these things take practice).\n", - "\n", - "When the `Dtype` is `object`, it's saying that it's a string.\n", - "\n", - "However, when look at it...\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 11/16/2006 0:00\n", - "1 3/26/2004 0:00\n", - "2 2/26/2004 0:00\n", - "3 5/19/2011 0:00\n", - "4 7/23/2009 0:00\n", - "5 12/18/2008 0:00\n", - "6 8/26/2004 0:00\n", - "7 11/17/2005 0:00\n", - "8 8/27/2009 0:00\n", - "9 8/9/2007 0:00\n", - "Name: saledate, dtype: object" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df[\"saledate\"][:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can see that these `object`'s are in the form of dates.\n", - "\n", - "Since we're working on a **time series** problem (a machine learning problem with a time component), it's probably worth it to turn these strings into Python `datetime` objects.\n", - "\n", - "Before we do, let's try visualize our `saledate` column against our `SalePrice` column.\n", - "\n", - "To do so, we can create a scatter plot.\n", - "\n", - "And to prevent our plot from being too big, how about we visualize the first 1000 values?" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df[\"saledate\"][:1000], # visualize the first 1000 values\n", - " y=df[\"SalePrice\"][:1000])\n", - "ax.set_xlabel(\"Sale Date\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... looks like the x-axis is quite crowded.\n", - "\n", - "Maybe we can fix this by turning the `saledate` column into `datetime` format.\n", - "\n", - "Good news is that is looks like our `SalePrice` column is already in `float64` format so we can view its distribution directly from the DataFrame using a histogram plot." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# View SalePrice distribution \n", - "df.SalePrice.plot.hist(xlabel=\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.1 Parsing dates\n", - "\n", - "When working with time series data, it's a good idea to make sure any date data is the format of a [datetime object](https://docs.python.org/3/library/datetime.html) (a Python data type which encodes specific information about dates).\n", - "\n", - "We can tell pandas which columns to read in as dates by setting the `parse_dates` parameter in [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).\n", - "\n", - "Once we've imported our CSV with the `saledate` column parsed, we can view information about our DataFrame again with `df.info()`. " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 53 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64 \n", - " 8 UsageBand 73670 non-null object \n", - " 9 saledate 412698 non-null datetime64[ns]\n", - " 10 fiModelDesc 412698 non-null object \n", - " 11 fiBaseModel 412698 non-null object \n", - " 12 fiSecondaryDesc 271971 non-null object \n", - " 13 fiModelSeries 58667 non-null object \n", - " 14 fiModelDescriptor 74816 non-null object \n", - " 15 ProductSize 196093 non-null object \n", - " 16 fiProductClassDesc 412698 non-null object \n", - " 17 state 412698 non-null object \n", - " 18 ProductGroup 412698 non-null object \n", - " 19 ProductGroupDesc 412698 non-null object \n", - " 20 Drive_System 107087 non-null object \n", - " 21 Enclosure 412364 non-null object \n", - " 22 Forks 197715 non-null object \n", - " 23 Pad_Type 81096 non-null object \n", - " 24 Ride_Control 152728 non-null object \n", - " 25 Stick 81096 non-null object \n", - " 26 Transmission 188007 non-null object \n", - " 27 Turbocharged 81096 non-null object \n", - " 28 Blade_Extension 25983 non-null object \n", - " 29 Blade_Width 25983 non-null object \n", - " 30 Enclosure_Type 25983 non-null object \n", - " 31 Engine_Horsepower 25983 non-null object \n", - " 32 Hydraulics 330133 non-null object \n", - " 33 Pushblock 25983 non-null object \n", - " 34 Ripper 106945 non-null object \n", - " 35 Scarifier 25994 non-null object \n", - " 36 Tip_Control 25983 non-null object \n", - " 37 Tire_Size 97638 non-null object \n", - " 38 Coupler 220679 non-null object \n", - " 39 Coupler_System 44974 non-null object \n", - " 40 Grouser_Tracks 44875 non-null object \n", - " 41 Hydraulics_Flow 44875 non-null object \n", - " 42 Track_Type 102193 non-null object \n", - " 43 Undercarriage_Pad_Width 102916 non-null object \n", - " 44 Stick_Length 102261 non-null object \n", - " 45 Thumb 102332 non-null object \n", - " 46 Pattern_Changer 102261 non-null object \n", - " 47 Grouser_Type 102193 non-null object \n", - " 48 Backhoe_Mounting 80712 non-null object \n", - " 49 Blade_Type 81875 non-null object \n", - " 50 Travel_Controls 81877 non-null object \n", - " 51 Differential_Type 71564 non-null object \n", - " 52 Steering_Controls 71522 non-null object \n", - "dtypes: datetime64[ns](1), float64(3), int64(5), object(44)\n", - "memory usage: 166.9+ MB\n" - ] - } - ], - "source": [ - "df = pd.read_csv(filepath_or_buffer=\"../data/bluebook-for-bulldozers/TrainAndValid.csv\",\n", - " low_memory=False, # set low_memory=False to prevent mixed data types warning \n", - " parse_dates=[\"saledate\"]) # can use the parse_dates parameter and specify which column to treat as a date column\n", - "\n", - "# With parse_dates... check dtype of \"saledate\"\n", - "df.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "Looks like our `saledate` column is now of type [`datetime64[ns]`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.datetime64), a NumPy-specific datetime format with high precision.\n", - "\n", - "Since pandas works well with NumPy, we can keep it in this format.\n", - "\n", - "How about we view a few samples from our `SaleDate` column again?" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "0 2006-11-16\n", - "1 2004-03-26\n", - "2 2004-02-26\n", - "3 2011-05-19\n", - "4 2009-07-23\n", - "5 2008-12-18\n", - "6 2004-08-26\n", - "7 2005-11-17\n", - "8 2009-08-27\n", - "9 2007-08-09\n", - "Name: saledate, dtype: datetime64[ns]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df[\"saledate\"][:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! That's looking much better already. \n", - "\n", - "We'll see how having our dates in this format is really helpful later on.\n", - "\n", - "For now, how about we visualize our `saledate` column against our `SalePrice` column again?" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df[\"saledate\"][:1000], # visualize the first 1000 values\n", - " y=df[\"SalePrice\"][:1000])\n", - "ax.set_xlabel(\"Sale Date\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.2 Sorting our DataFrame by saledate\n", - "\n", - "Now we've formatted our `saledate` column to be NumPy `datetime64[ns]` objects, we can use built-in pandas methods such as `sort_values` to sort our DataFrame by date.\n", - "\n", - "And considering this is a time series problem, sorting our DataFrame by date has the added benefit of making sure our data is sequential.\n", - "\n", - "In other words, we want to use examples from the past (example sale prices from previous dates) to try and predict future bulldozer sale prices. \n", - "\n", - "Let's use the [`pandas.DataFrame.sort_values`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) method to sort our DataFrame by `saledate` in ascending order." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(205615 1989-01-17\n", - " 274835 1989-01-31\n", - " 141296 1989-01-31\n", - " 212552 1989-01-31\n", - " 62755 1989-01-31\n", - " 54653 1989-01-31\n", - " 81383 1989-01-31\n", - " 204924 1989-01-31\n", - " 135376 1989-01-31\n", - " 113390 1989-01-31\n", - " Name: saledate, dtype: datetime64[ns],\n", - " 409202 2012-04-28\n", - " 408976 2012-04-28\n", - " 411695 2012-04-28\n", - " 411319 2012-04-28\n", - " 408889 2012-04-28\n", - " 410879 2012-04-28\n", - " 412476 2012-04-28\n", - " 411927 2012-04-28\n", - " 407124 2012-04-28\n", - " 409203 2012-04-28\n", - " Name: saledate, dtype: datetime64[ns])" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Sort DataFrame in date order\n", - "df.sort_values(by=[\"saledate\"], inplace=True, ascending=True)\n", - "df.saledate.head(10), df.saledate.tail(10)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "Looks like our older samples are now coming first and the newer samples are towards the end of the DataFrame." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1.3 Adding extra features to our DataFrame\n", - "\n", - "One way to potentially increase the predictive power of our data is to enhance it with more features.\n", - "\n", - "This practice is known as [**feature engineering**](https://en.wikipedia.org/wiki/Feature_engineering), taking existing features and using them to create more/different features. \n", - "\n", - "There is no set in stone way to do feature engineering and often it takes quite a bit of practice/exploration/experimentation to figure out what might work and what won't.\n", - "\n", - "For now, we'll use our `saledate` column to add extra features such as:\n", - "\n", - "* Year of sale\n", - "* Month of sale\n", - "* Day of sale\n", - "* Day of week sale (e.g. Monday = 1, Tuesday = 2)\n", - "* Day of year sale (e.g. January 1st = 1, January 2nd = 2)\n", - "\n", - "Since we're going to be manipulating the data, we'll make a copy of the original DataFrame and perform our changes there.\n", - "\n", - "This will keep the original DataFrame in tact if we need it again." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "# Make a copy of the original DataFrame to perform edits on\n", - "df_tmp = df.copy()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because we imported the data using `read_csv()` and we asked pandas to parse the dates using `parase_dates=[\"saledate\"]`, we can now access the [different datetime attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) of the `saledate` column.\n", - "\n", - "Let's use these attributes to add a series of different feature columns to our dataset. \n", - "\n", - "After we've added these extra columns, we can remove the original `saledate` column as its information will be dispersed across these new columns." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "# Add datetime parameters for saledate\n", - "df_tmp[\"saleYear\"] = df_tmp.saledate.dt.year\n", - "df_tmp[\"saleMonth\"] = df_tmp.saledate.dt.month\n", - "df_tmp[\"saleDay\"] = df_tmp.saledate.dt.day\n", - "df_tmp[\"saleDayofweek\"] = df_tmp.saledate.dt.dayofweek\n", - "df_tmp[\"saleDayofyear\"] = df_tmp.saledate.dt.dayofyear\n", - "\n", - "# Drop original saledate column\n", - "df_tmp.drop(\"saledate\", axis=1, inplace=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We could add more of these style of columns, such as, whether it was the start or end of a quarter (the sale being at the end of a quarter may bye influenced by things such as quarterly budgets) but these will do for now.\n", - "\n", - "> **Challenge:** See what other [datetime attributes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) you can add to `df_tmp` using a similar technique to what we've used above. Hint: check the bottom of the pandas.DatetimeIndex docs.\n", - "\n", - "How about we view some of our newly created columns?" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalePricesaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
2056159500.01989117117
27483514000.01989131131
14129650000.01989131131
21255216000.01989131131
6275522000.01989131131
\n", - "
" - ], - "text/plain": [ - " SalePrice saleYear saleMonth saleDay saleDayofweek saleDayofyear\n", - "205615 9500.0 1989 1 17 1 17\n", - "274835 14000.0 1989 1 31 1 31\n", - "141296 50000.0 1989 1 31 1 31\n", - "212552 16000.0 1989 1 31 1 31\n", - "62755 22000.0 1989 1 31 1 31" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# View newly created columns\n", - "df_tmp[[\"SalePrice\", \"saleYear\", \"saleMonth\", \"saleDay\", \"saleDayofweek\", \"saleDayofyear\"]].head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Cool!\n", - "\n", - "Now we've broken our `saledate` column into columns/features, we can perform further exploratory analysis such as visualizing the `SalePrice` against the `saleMonth`.\n", - "\n", - "How about we view the first 10,000 samples (we could also randomly select 10,000 samples too) to see if reveals anything about which month has the highest sales?" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# View 10,000 samples SalePrice against saleMonth\n", - "fig, ax = plt.subplots()\n", - "ax.scatter(x=df_tmp[\"saleMonth\"][:10000], # visualize the first 10000 values\n", - " y=df_tmp[\"SalePrice\"][:10000])\n", - "ax.set_xlabel(\"Sale Month\")\n", - "ax.set_ylabel(\"Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... doesn't look like there's too much conclusive evidence here about which month has the highest sales value.\n", - "\n", - "How about we plot the median sale price of each month?\n", - "\n", - "We can do so by grouping on the `saleMonth` column with [`pandas.DataFrame.groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) and then getting the median of the `SalePrice` column." - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Group DataFrame by saleMonth and then find the median SalePrice\n", - "df_tmp.groupby([\"saleMonth\"])[\"SalePrice\"].median().plot()\n", - "plt.xlabel(\"Month\")\n", - "plt.ylabel(\"Median Sale Price ($)\");" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ohhh it looks like the median sale prices of January and February (months 1 and 2) are quite a bit higher than the other months of the year.\n", - "\n", - "Could this be because of New Year budget spending?\n", - "\n", - "Perhaps... but this would take a bit more investigation.\n", - "\n", - "In the meantime, there are many other values we could look further into." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - 1.4 Inspect values of other columns\n", - "\n", - "When first exploring a new problem, it's often a good idea to become as familiar with the data as you can.\n", - "\n", - "Of course, with a dataset that has over 400,000 samples, it's unlikely you'll ever get through every sample.\n", - "\n", - "But that's where the power of data analysis and machine learning can help.\n", - "\n", - "We can use pandas to aggregate thousands of samples into smaller more managable pieces.\n", - "\n", - "And as we'll see later on, we can use machine learning models to model the data and then later inspect which features the model thought were most important.\n", - "\n", - "How about we see which states sell the most bulldozers?" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "state\n", - "Florida 67320\n", - "Texas 53110\n", - "California 29761\n", - "Washington 16222\n", - "Georgia 14633\n", - "Maryland 13322\n", - "Mississippi 13240\n", - "Ohio 12369\n", - "Illinois 11540\n", - "Colorado 11529\n", - "Name: count, dtype: int64" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check the different values of different columns\n", - "df_tmp.state.value_counts()[:10]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! Looks like Flordia sells a fair few bulldozers.\n", - "\n", - "How about we go even further and group our samples by `state` and then find the median `SalePrice` per state?\n", - "\n", - "We also compare this to the median `SalePrice` for all samples." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "# Group DataFrame by saleMonth and then find the median SalePrice per state as well as across the whole dataset\n", - "median_prices_by_state = df_tmp.groupby([\"state\"])[\"SalePrice\"].median() # this will return a pandas Series rather than a DataFrame\n", - "median_sale_price = df_tmp[\"SalePrice\"].median()\n", - "\n", - "# Create a plot comparing median sale price per state to median sale price overall\n", - "plt.figure(figsize=(10, 7))\n", - "plt.bar(x=median_prices_by_state.index, # Because we're working with a Series, we can use the index (state names) as the x values\n", - " height=median_prices_by_state.values)\n", - "plt.xlabel(\"State\")\n", - "plt.ylabel(\"Median Sale Price ($)\")\n", - "plt.xticks(rotation=90, fontsize=7);\n", - "plt.axhline(y=median_sale_price, \n", - " color=\"red\", \n", - " linestyle=\"--\", \n", - " label=f\"Median Sale Price: ${median_sale_price:,.0f}\")\n", - "plt.legend();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that's a nice looking figure!\n", - "\n", - "Interestingly Florida has the most sales and the median sale price is above the overall median of all other states.\n", - "\n", - "And if you had a bulldozer and were chasing the highest sale price, the data would reveal that perhaps selling in South Dakota would be your best bet.\n", - "\n", - "Perhaps bulldozers are in higher demand in South Dakota because of a building or mining boom?\n", - "\n", - "Answering this would require a bit more research.\n", - "\n", - "But what we're doing here is slowly building up a mental model of our data. \n", - "\n", - "So that if we saw an example in the future, we could compare its values to the ones we've already seen." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Model driven exploration\n", - "\n", - "We've performed a small Exploratory Data Analysis (EDA) as well as enriched it with some `datetime` attributes, now let's try to model it.\n", - "\n", - "Why model so early?\n", - "\n", - "Well, we know the evaluation metric (root mean squared log error or RMSLE) we're heading towards. \n", - "\n", - "We could spend more time doing EDA, finding more out about the data ourselves but what we'll do instead is use a machine learning model to help us do EDA whilst simultaneously working towards the best evaluation metric we can get. \n", - "\n", - "Remember, one of the biggest goals of starting any new machine learning project is reducing the time between experiments.\n", - "\n", - "Following the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/machine_learning_map.html) and taking into account the fact we've got over 100,000 examples, we find a [`sklearn.linear_model.SGDRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html) or a [`sklearn.ensemble.RandomForestRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor) model might be a good candidate.\n", - "\n", - "Since we're worked with the Random Forest algorithm before (on the [heart disease classification problem](https://dev.mrdbourke.com/zero-to-mastery-ml/end-to-end-heart-disease-classification/)), let's try it out on our regression problem.\n", - "\n", - "> **Note:** We're trying just one model here for now. But you can try many other kinds of models from the Scikit-Learn library, they mostly work with a similar API. There are even libraries such as [`LazyPredict`](https://github.com/shankarpandala/lazypredict) which will try many models simultaneously and return a table with the results." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "could not convert string to float: 'Low'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[0;32m/var/folders/c4/qj4gdk190td18bqvjjh0p3p00000gn/T/ipykernel_21543/2824176890.py\u001b[0m in \u001b[0;36m?\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# This won't work since we've got missing numbers and categories\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mensemble\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mmodel\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mRandomForestRegressor\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn_jobs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m model.fit(X=df_tmp.drop(\"SalePrice\", axis=1), # use all columns except SalePrice as X input\n\u001b[0m\u001b[1;32m 6\u001b[0m y=df_tmp.SalePrice) # use SalePrice column as y input\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(estimator, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1469\u001b[0m skip_parameter_validation=(\n\u001b[1;32m 1470\u001b[0m \u001b[0mprefer_skip_nested_validation\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0mglobal_skip_validation\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1471\u001b[0m )\n\u001b[1;32m 1472\u001b[0m ):\n\u001b[0;32m-> 1473\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfit_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mestimator\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/ensemble/_forest.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0;31m# Validate or convert input data\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 360\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0missparse\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 361\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"sparse multilabel-indicator for y is not supported.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 362\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 363\u001b[0;31m X, y = self._validate_data(\n\u001b[0m\u001b[1;32m 364\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 365\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 366\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 646\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;34m\"estimator\"\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mcheck_y_params\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 647\u001b[0m \u001b[0mcheck_y_params\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m**\u001b[0m\u001b[0mdefault_check_params\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 648\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"y\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 649\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 650\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 651\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 652\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 653\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mno_val_X\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mcheck_params\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"ensure_2d\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 1297\u001b[0m raise ValueError(\n\u001b[1;32m 1298\u001b[0m \u001b[0;34mf\"{estimator_name} requires y to be passed, but the target y is None\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1299\u001b[0m )\n\u001b[1;32m 1300\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1301\u001b[0;31m X = check_array(\n\u001b[0m\u001b[1;32m 1302\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1303\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1304\u001b[0m \u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_large_sparse\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_writeable, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)\u001b[0m\n\u001b[1;32m 1009\u001b[0m )\n\u001b[1;32m 1010\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mastype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1011\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1012\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_asarray_with_order\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mxp\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mxp\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1013\u001b[0;31m \u001b[0;32mexcept\u001b[0m \u001b[0mComplexWarning\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mcomplex_warning\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1014\u001b[0m raise ValueError(\n\u001b[1;32m 1015\u001b[0m \u001b[0;34m\"Complex data not supported\\n{}\\n\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1016\u001b[0m ) from complex_warning\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/utils/_array_api.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(array, dtype, order, copy, xp, device)\u001b[0m\n\u001b[1;32m 747\u001b[0m \u001b[0;31m# Use NumPy API to support order\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 748\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcopy\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mTrue\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 749\u001b[0m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 750\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 751\u001b[0;31m \u001b[0marray\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnumpy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0morder\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0morder\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 752\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 753\u001b[0m \u001b[0;31m# At this point array is a NumPy ndarray. We convert it to an array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 754\u001b[0m \u001b[0;31m# container that is consistent with the input's namespace.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;32m~/miniforge3/envs/ai/lib/python3.11/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m?\u001b[0;34m(self, dtype, copy)\u001b[0m\n\u001b[1;32m 2149\u001b[0m def __array__(\n\u001b[1;32m 2150\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mnpt\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mDTypeLike\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mbool_t\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2151\u001b[0m ) -> np.ndarray:\n\u001b[1;32m 2152\u001b[0m \u001b[0mvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2153\u001b[0;31m \u001b[0marr\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0masarray\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2154\u001b[0m if (\n\u001b[1;32m 2155\u001b[0m \u001b[0mastype_is_view\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2156\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0musing_copy_on_write\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", - "\u001b[0;31mValueError\u001b[0m: could not convert string to float: 'Low'" - ] - } - ], - "source": [ - "# This won't work since we've got missing numbers and categories\n", - "from sklearn.ensemble import RandomForestRegressor\n", - "\n", - "model = RandomForestRegressor(n_jobs=-1)\n", - "model.fit(X=df_tmp.drop(\"SalePrice\", axis=1), # use all columns except SalePrice as X input\n", - " y=df_tmp.SalePrice) # use SalePrice column as y input" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Oh no!\n", - "\n", - "When we try to fit our model to the data, we get a value error similar to:\n", - "\n", - "> `ValueError: could not convert string to float: 'Low'`\n", - "\n", - "The problem here is that some of the features of our data are in string format and machine learning models love numbers.\n", - "\n", - "Not to mention some of our samples have missing values.\n", - "\n", - "And typically, machine learning models require all data to be in numerical format as well as all missing values to be filled.\n", - "\n", - "Let's start to fix this by inspecting the different datatypes in our DataFrame.\n", - "\n", - "We can do so using the [`pandas.DataFrame.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method, this will give us the different datatypes as well as how many non-null (a null value is generally a missing value) in our `df_tmp` DataFrame.\n", - "\n", - "> **Note:** There are some ML models such as [`sklearn.ensemble.HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html), [CatBoost](https://catboost.ai/) and [XGBoost](https://xgboost.ai/) which can handle missing values, however, I'll leave exploring each of these as extra-curriculum/extensions." - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "Index: 412698 entries, 205615 to 409203\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 fiModelDesc 412698 non-null object \n", - " 10 fiBaseModel 412698 non-null object \n", - " 11 fiSecondaryDesc 271971 non-null object \n", - " 12 fiModelSeries 58667 non-null object \n", - " 13 fiModelDescriptor 74816 non-null object \n", - " 14 ProductSize 196093 non-null object \n", - " 15 fiProductClassDesc 412698 non-null object \n", - " 16 state 412698 non-null object \n", - " 17 ProductGroup 412698 non-null object \n", - " 18 ProductGroupDesc 412698 non-null object \n", - " 19 Drive_System 107087 non-null object \n", - " 20 Enclosure 412364 non-null object \n", - " 21 Forks 197715 non-null object \n", - " 22 Pad_Type 81096 non-null object \n", - " 23 Ride_Control 152728 non-null object \n", - " 24 Stick 81096 non-null object \n", - " 25 Transmission 188007 non-null object \n", - " 26 Turbocharged 81096 non-null object \n", - " 27 Blade_Extension 25983 non-null object \n", - " 28 Blade_Width 25983 non-null object \n", - " 29 Enclosure_Type 25983 non-null object \n", - " 30 Engine_Horsepower 25983 non-null object \n", - " 31 Hydraulics 330133 non-null object \n", - " 32 Pushblock 25983 non-null object \n", - " 33 Ripper 106945 non-null object \n", - " 34 Scarifier 25994 non-null object \n", - " 35 Tip_Control 25983 non-null object \n", - " 36 Tire_Size 97638 non-null object \n", - " 37 Coupler 220679 non-null object \n", - " 38 Coupler_System 44974 non-null object \n", - " 39 Grouser_Tracks 44875 non-null object \n", - " 40 Hydraulics_Flow 44875 non-null object \n", - " 41 Track_Type 102193 non-null object \n", - " 42 Undercarriage_Pad_Width 102916 non-null object \n", - " 43 Stick_Length 102261 non-null object \n", - " 44 Thumb 102332 non-null object \n", - " 45 Pattern_Changer 102261 non-null object \n", - " 46 Grouser_Type 102193 non-null object \n", - " 47 Backhoe_Mounting 80712 non-null object \n", - " 48 Blade_Type 81875 non-null object \n", - " 49 Travel_Controls 81877 non-null object \n", - " 50 Differential_Type 71564 non-null object \n", - " 51 Steering_Controls 71522 non-null object \n", - " 52 saleYear 412698 non-null int32 \n", - " 53 saleMonth 412698 non-null int32 \n", - " 54 saleDay 412698 non-null int32 \n", - " 55 saleDayofweek 412698 non-null int32 \n", - " 56 saleDayofyear 412698 non-null int32 \n", - "dtypes: float64(3), int32(5), int64(5), object(44)\n", - "memory usage: 174.7+ MB\n" - ] - } - ], - "source": [ - "# Check for missing values and different datatypes \n", - "df_tmp.info();" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, it seems as though we've got a fair few different datatypes. \n", - "\n", - "There are `int64` types such as `MachineID`.\n", - "\n", - "There are `float64` types such as `SalePrice`.\n", - "\n", - "And there are `object` (the `object` dtype can hold any Python object, including strings) types such as `UseageBand`.\n", - "\n", - "> **Resource:** You can see a list of all the [pandas dtypes in the pandas user guide](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes).\n", - "\n", - "How about we find out how many missing values are in each column?\n", - "\n", - "We can do so using [`pandas.DataFrame.isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) (`isna` stands for 'is null or NaN') which will return a boolean `True`/`False` if a value is missing (`True` if missing, `False` if not). \n", - "\n", - "Let's start by checking the missing values in the head of our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
205615FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
274835FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...TrueTrueTrueFalseFalseFalseFalseFalseFalseFalse
141296FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
212552FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...TrueTrueTrueFalseFalseFalseFalseFalseFalseFalse
62755FalseFalseFalseFalseFalseFalseFalseTrueTrueFalse...FalseFalseFalseTrueTrueFalseFalseFalseFalseFalse
\n", - "

5 rows × 57 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "205615 False False False False False False \n", - "274835 False False False False False False \n", - "141296 False False False False False False \n", - "212552 False False False False False False \n", - "62755 False False False False False False \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "205615 False True True False ... \n", - "274835 False True True False ... \n", - "141296 False True True False ... \n", - "212552 False True True False ... \n", - "62755 False True True False ... \n", - "\n", - " Backhoe_Mounting Blade_Type Travel_Controls Differential_Type \\\n", - "205615 False False False True \n", - "274835 True True True False \n", - "141296 False False False True \n", - "212552 True True True False \n", - "62755 False False False True \n", - "\n", - " Steering_Controls saleYear saleMonth saleDay saleDayofweek \\\n", - "205615 True False False False False \n", - "274835 False False False False False \n", - "141296 True False False False False \n", - "212552 False False False False False \n", - "62755 True False False False False \n", - "\n", - " saleDayofyear \n", - "205615 False \n", - "274835 False \n", - "141296 False \n", - "212552 False \n", - "62755 False \n", - "\n", - "[5 rows x 57 columns]" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find missing values in the head of our DataFrame \n", - "df_tmp.head().isna()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Alright it seems as though we've got some missing values in the `MachineHoursCurrentMeter` as well as the `UsageBand` and a few other columns.\n", - "\n", - "But so far we've only viewed the first few rows.\n", - "\n", - "It'll be very time consuming to go through each row one by one so how about we get the total missing values per column?\n", - "\n", - "We can do so by calling `.isna()` on the whole DataFrame and then chaining it together with `.sum()`.\n", - "\n", - "Doing so will give us the total `True`/`False` values in a given column (when summing, `True` = 1, `False` = 0)." - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "SalesID 0\n", - "SalePrice 0\n", - "MachineID 0\n", - "ModelID 0\n", - "datasource 0\n", - "auctioneerID 20136\n", - "YearMade 0\n", - "MachineHoursCurrentMeter 265194\n", - "UsageBand 339028\n", - "fiModelDesc 0\n", - "fiBaseModel 0\n", - "fiSecondaryDesc 140727\n", - "fiModelSeries 354031\n", - "fiModelDescriptor 337882\n", - "ProductSize 216605\n", - "fiProductClassDesc 0\n", - "state 0\n", - "ProductGroup 0\n", - "ProductGroupDesc 0\n", - "Drive_System 305611\n", - "Enclosure 334\n", - "Forks 214983\n", - "Pad_Type 331602\n", - "Ride_Control 259970\n", - "Stick 331602\n", - "Transmission 224691\n", - "Turbocharged 331602\n", - "Blade_Extension 386715\n", - "Blade_Width 386715\n", - "Enclosure_Type 386715\n", - "Engine_Horsepower 386715\n", - "Hydraulics 82565\n", - "Pushblock 386715\n", - "Ripper 305753\n", - "Scarifier 386704\n", - "Tip_Control 386715\n", - "Tire_Size 315060\n", - "Coupler 192019\n", - "Coupler_System 367724\n", - "Grouser_Tracks 367823\n", - "Hydraulics_Flow 367823\n", - "Track_Type 310505\n", - "Undercarriage_Pad_Width 309782\n", - "Stick_Length 310437\n", - "Thumb 310366\n", - "Pattern_Changer 310437\n", - "Grouser_Type 310505\n", - "Backhoe_Mounting 331986\n", - "Blade_Type 330823\n", - "Travel_Controls 330821\n", - "Differential_Type 341134\n", - "Steering_Controls 341176\n", - "saleYear 0\n", - "saleMonth 0\n", - "saleDay 0\n", - "saleDayofweek 0\n", - "saleDayofyear 0\n", - "dtype: int64" - ] - }, - "execution_count": 24, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check for total missing values per column\n", - "df_tmp.isna().sum()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! It looks like our DataFrame has quite a few missing values.\n", - "\n", - "Not to worry, we can work on fixing this later on.\n", - "\n", - "How about we start by tring to turn all of our data in numbers? " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK (change heading?) - Convert strings to categories - TK - possible option: Inspecting the datatypes in our DataFrame \n", - "\n", - "UPTOHERE - getting all values into numbers (e.g. objects -> categories)\n", - "\n", - "One way to help turn all of our data into numbers is to convert the columns with the `object` datatype into a `category` datatype using [`pandas.CategoricalDtype`](https://pandas.pydata.org/docs/reference/api/pandas.CategoricalDtype.html).\n", - "\n", - "> **Note:** There are many different ways to convert values into numbers. And often the best way will be specific to the value you're trying to convert. The method we're going to use, converting all objects (that are mostly strings) to categories is one of the faster methods as it makes a quick assumption that each unique value is its own number. \n", - "\n", - "We can check the datatype of an individual column using the `.dtype` attribute and we can get its full name using `.dtype.name`." - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(dtype('O'), 'object')" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the dtype of a given column\n", - "df_tmp[\"UsageBand\"].dtype, df_tmp[\"UsageBand\"].dtype.name" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "Now we've got a way to check a column's datatype individually.\n", - "\n", - "There's also another group of methods to check a column's datatype directly.\n", - "\n", - "For example, using [`pd.api.types.is_object_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_object_dtype.html) we can get a boolean response as to whether the input is an object or not.\n", - "\n", - "> **Note:** There are many more of these checks you can perform for other datatypes such as strings under a similar name space `pd.api.types.is_XYZ_dtype`. See the [pandas documentation](https://pandas.pydata.org/docs/reference/arrays.html) for more.\n", - "\n", - "Let's see how it works on our `df_tmp[\"UsageBand\"]` column." - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check whether a column is an object\n", - "pd.api.types.is_object_dtype(df_tmp[\"UsageBand\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can also check whether a column is a string with [`pd.api.types.is_string_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_string_dtype.html). " - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "True" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check whether a column is a string\n", - "pd.api.types.is_string_dtype(df_tmp[\"state\"])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We can even loop through the items (columns and their labels) in our DataFrame using [`pandas.DataFrame.items()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.items.html) (in Python dictionary terms, calling `.items()` on a DataFrame will treat the column names as the keys and the column values as the values) and print out samples of columns which have the `string` datatype.\n", - "\n", - "As an extra check, passing the sample to [`pd.api.types.infer_dtype()`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.infer_dtype.html) will return the datatype of the sample.\n", - "\n", - "This will be a good way to keep exploring our data." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This is a key: key1\n", - "This is a value: hello\n", - "This is a key: key2\n", - "This is a value: world!\n" - ] - } - ], - "source": [ - "# Quick exampke of calling .items() on a dictionary\n", - "random_dict = {\"key1\": \"hello\",\n", - " \"key2\": \"world!\"}\n", - "\n", - "for key, value in random_dict.items():\n", - " print(f\"This is a key: {key}\")\n", - " print(f\"This is a value: {value}\")" - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: fiModelDesc | Column dtype: object | Example value: ['330BL'] | Example value dtype: string\n", - "Column name: fiBaseModel | Column dtype: object | Example value: ['906'] | Example value dtype: string\n", - "Column name: fiProductClassDesc | Column dtype: object | Example value: ['Wheel Loader - 100.0 to 110.0 Horsepower'] | Example value dtype: string\n", - "Column name: state | Column dtype: object | Example value: ['Washington'] | Example value dtype: string\n", - "Column name: ProductGroup | Column dtype: object | Example value: ['TTT'] | Example value dtype: string\n", - "Column name: ProductGroupDesc | Column dtype: object | Example value: ['Track Type Tractors'] | Example value dtype: string\n" - ] - } - ], - "source": [ - "# Print column names and example content of columns which contain strings\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_string_dtype(content):\n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Hmm... it seems that there are many more columns in the `df_tmp` with the `object` type that didn't display when checking for the string datatype (we know there are many `object` datatype columns in our DataFrame from using `df_tmp.info()`).\n", - "\n", - "How about we try the same as above, except this time instead of `pd.api.types.is_string_dtype`, we use `pd.api.types.is_object_dtype`?\n", - "\n", - "Let's try it." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: UsageBand | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiModelDesc | Column dtype: object | Example value: ['560B'] | Example value dtype: string\n", - "Column name: fiBaseModel | Column dtype: object | Example value: ['310'] | Example value dtype: string\n", - "Column name: fiSecondaryDesc | Column dtype: object | Example value: ['LC'] | Example value dtype: string\n", - "Column name: fiModelSeries | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiModelDescriptor | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: ProductSize | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: fiProductClassDesc | Column dtype: object | Example value: ['Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower'] | Example value dtype: string\n", - "Column name: state | Column dtype: object | Example value: ['Texas'] | Example value dtype: string\n", - "Column name: ProductGroup | Column dtype: object | Example value: ['TTT'] | Example value dtype: string\n", - "Column name: ProductGroupDesc | Column dtype: object | Example value: ['Wheel Loader'] | Example value dtype: string\n", - "Column name: Drive_System | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Enclosure | Column dtype: object | Example value: ['OROPS'] | Example value dtype: string\n", - "Column name: Forks | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Pad_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Ride_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Stick | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Transmission | Column dtype: object | Example value: ['Standard'] | Example value dtype: string\n", - "Column name: Turbocharged | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Blade_Extension | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Blade_Width | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Enclosure_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Engine_Horsepower | Column dtype: object | Example value: ['No'] | Example value dtype: string\n", - "Column name: Hydraulics | Column dtype: object | Example value: ['2 Valve'] | Example value dtype: string\n", - "Column name: Pushblock | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Ripper | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Scarifier | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Tip_Control | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Tire_Size | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Coupler | Column dtype: object | Example value: ['Manual'] | Example value dtype: string\n", - "Column name: Coupler_System | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Grouser_Tracks | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Hydraulics_Flow | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Track_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Undercarriage_Pad_Width | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Stick_Length | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Thumb | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Pattern_Changer | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Grouser_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Backhoe_Mounting | Column dtype: object | Example value: ['None or Unspecified'] | Example value dtype: string\n", - "Column name: Blade_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Travel_Controls | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Differential_Type | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "Column name: Steering_Controls | Column dtype: object | Example value: [nan] | Example value dtype: empty\n", - "\n", - "[INFO] Total number of object type columns: 44\n" - ] - } - ], - "source": [ - "# Start a count of how many object type columns there are\n", - "number_of_object_type_columns = 0\n", - "\n", - "for label, content in df_tmp.items():\n", - " # Check to see if column is of object type (this will include the string columns)\n", - " if pd.api.types.is_object_dtype(content): \n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")\n", - "\n", - " number_of_object_type_columns += 1\n", - "\n", - "print(f\"\\n[INFO] Total number of object type columns: {number_of_object_type_columns}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful, looks like we've got sample outputs from all of the columns with the `object` datatype.\n", - "\n", - "It also looks like that many of random samples are missing values." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Converting strings to categories \n", - "\n", - "In pandas, one way to convert object/string values to numerical values is to convert them to categories or more specifically, the `pd.CategoricalDtype` datatype.\n", - "\n", - "This datatype keeps the underlying data the same (e.g. doesn't change the string) but enables easy conversion to a numeric code using [`.cat.codes`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.codes.html).\n", - "\n", - "For example, the column `state` might have the values `'Alabama', 'Alaska', 'Arizona'...` and these could be mapped to numeric values `1, 2, 3...` respectively.\n", - "\n", - "To see this in action, let's first convert the object datatype columns to `\"category\"` datatype.\n", - "\n", - "We can do so by looping through the `.items()` of our DataFrame and reassigning each object datatype column using [`pandas.Series.astype(dtype=\"category\")`](https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 55, - "metadata": {}, - "outputs": [], - "source": [ - "# This will turn all of the object columns into category values\n", - "for label, content in df_tmp.items(): \n", - " if pd.api.types.is_object_dtype(content):\n", - " df_tmp[label] = df_tmp[label].astype(\"category\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful!\n", - "\n", - "Now let's check if it worked by calling `.info()` on our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64 \n", - " 8 UsageBand 73670 non-null category\n", - " 9 fiModelDesc 412698 non-null category\n", - " 10 fiBaseModel 412698 non-null category\n", - " 11 fiSecondaryDesc 271971 non-null category\n", - " 12 fiModelSeries 58667 non-null category\n", - " 13 fiModelDescriptor 74816 non-null category\n", - " 14 ProductSize 196093 non-null category\n", - " 15 fiProductClassDesc 412698 non-null category\n", - " 16 state 412698 non-null category\n", - " 17 ProductGroup 412698 non-null category\n", - " 18 ProductGroupDesc 412698 non-null category\n", - " 19 Drive_System 107087 non-null category\n", - " 20 Enclosure 412364 non-null category\n", - " 21 Forks 197715 non-null category\n", - " 22 Pad_Type 81096 non-null category\n", - " 23 Ride_Control 152728 non-null category\n", - " 24 Stick 81096 non-null category\n", - " 25 Transmission 188007 non-null category\n", - " 26 Turbocharged 81096 non-null category\n", - " 27 Blade_Extension 25983 non-null category\n", - " 28 Blade_Width 25983 non-null category\n", - " 29 Enclosure_Type 25983 non-null category\n", - " 30 Engine_Horsepower 25983 non-null category\n", - " 31 Hydraulics 330133 non-null category\n", - " 32 Pushblock 25983 non-null category\n", - " 33 Ripper 106945 non-null category\n", - " 34 Scarifier 25994 non-null category\n", - " 35 Tip_Control 25983 non-null category\n", - " 36 Tire_Size 97638 non-null category\n", - " 37 Coupler 220679 non-null category\n", - " 38 Coupler_System 44974 non-null category\n", - " 39 Grouser_Tracks 44875 non-null category\n", - " 40 Hydraulics_Flow 44875 non-null category\n", - " 41 Track_Type 102193 non-null category\n", - " 42 Undercarriage_Pad_Width 102916 non-null category\n", - " 43 Stick_Length 102261 non-null category\n", - " 44 Thumb 102332 non-null category\n", - " 45 Pattern_Changer 102261 non-null category\n", - " 46 Grouser_Type 102193 non-null category\n", - " 47 Backhoe_Mounting 80712 non-null category\n", - " 48 Blade_Type 81875 non-null category\n", - " 49 Travel_Controls 81877 non-null category\n", - " 50 Differential_Type 71564 non-null category\n", - " 51 Steering_Controls 71522 non-null category\n", - " 52 saleYear 412698 non-null int64 \n", - " 53 saleMonth 412698 non-null int64 \n", - " 54 saleDay 412698 non-null int64 \n", - " 55 saleDayofweek 412698 non-null int64 \n", - " 56 saleDayofyear 412698 non-null int64 \n", - "dtypes: category(44), float64(3), int64(10)\n", - "memory usage: 60.1 MB\n" - ] - } - ], - "source": [ - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It looks like it worked!\n", - "\n", - "All of the object datatype columns now have the category datatype.\n", - "\n", - "We can inspect this on a single column using `pandas.Series.dtype`." - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "CategoricalDtype(categories=['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',\n", - " 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia',\n", - " 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas',\n", - " 'Kentucky', 'Louisiana', 'Maine', 'Maryland',\n", - " 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',\n", - " 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',\n", - " 'New Jersey', 'New Mexico', 'New York', 'North Carolina',\n", - " 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',\n", - " 'Puerto Rico', 'Rhode Island', 'South Carolina',\n", - " 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah',\n", - " 'Vermont', 'Virginia', 'Washington', 'Washington DC',\n", - " 'West Virginia', 'Wisconsin', 'Wyoming'],\n", - ", ordered=False, categories_dtype=object)" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check the datatype of a single column\n", - "df_tmp.state.dtype" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent, notice how the column is now of type `pd.CategoricalDtype`.\n", - "\n", - "We can also access these categories using [`pandas.Series.cat.categories`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.categories.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado',\n", - " 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho',\n", - " 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana',\n", - " 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',\n", - " 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada',\n", - " 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',\n", - " 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon',\n", - " 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',\n", - " 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah', 'Vermont',\n", - " 'Virginia', 'Washington', 'Washington DC', 'West Virginia', 'Wisconsin',\n", - " 'Wyoming'],\n", - " dtype='object')" - ] - }, - "execution_count": 61, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Get the category names of a given column\n", - "df_tmp.state.cat.categories" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we can get the category codes (the numeric values representing the category) using [`pandas.Series.cat.codes`](https://pandas.pydata.org/docs/reference/api/pandas.Series.cat.codes.html)." - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "205615 43\n", - "274835 8\n", - "141296 8\n", - "212552 8\n", - "62755 8\n", - " ..\n", - "410879 4\n", - "412476 4\n", - "411927 4\n", - "407124 4\n", - "409203 4\n", - "Length: 412698, dtype: int8" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Inspect the category codes\n", - "df_tmp.state.cat.codes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This gives us a numeric representation of our object/string datatype columns." - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Target state category number 43 maps to: Texas\n" - ] - } - ], - "source": [ - "# Get example string using category number\n", - "target_state_cat_number = 43\n", - "target_state_cat_value = df_tmp.state.cat.categories[target_state_cat_number] \n", - "print(f\"[INFO] Target state category number {target_state_cat_number} maps to: {target_state_cat_value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "UPTOHERE - filling missing values, perhaps it's better to create a separate section for this... we don't necessarily need to save the updated values either?\n", - "TK - Could do:\n", - "- try to fit model (doesn't work)\n", - "- still have missing values\n", - "- save values with categories updated\n", - "- fill missing values\n", - "- fit model (works)\n", - "- what's wrong with this?\n", - "- import valid/train datasets separately + update to numerical + fill missing values with Scikit-Learn (as an alternative) \n", - "- fit model... \n", - "\n", - "All of our data is categorical and thus we can now turn the categories into numbers, however it's still missing values..." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Saving our preprocessed data (part 1)\n", - "\n", - "We've updated our dataset to turn object datatypes into categories.\n", - "\n", - "However, it still contains missing values.\n", - "\n", - "Before we get to those, how about we save our current DataFrame to file so we could import it again later if necessary.\n", - "\n", - "Saving and updating your dataset as you go is common practice in machine learning problems. As your problem changes and evolves, the dataset you're working with will likely change too.\n", - "\n", - "Making checkpoints of your dataset is similar to making checkpoints of your code." - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [], - "source": [ - "# Save preprocessed data to file\n", - "df_tmp.to_csv(\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv\",\n", - " index=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now we've saved our preprocessed data to file, we can re-import it and make sure it's in the same format." - ] - }, - { - "cell_type": "code", - "execution_count": 137, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Backhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyear
016467709500.01126363843413218.01974NaNNaNTD20...None or UnspecifiedStraightNone or UnspecifiedNaNNaN1989117117
1182151414000.011940891015013299.01980NaNNaNA66...NaNNaNNaNStandardConventional1989131131
2150513850000.01473654413913299.01978NaNNaND7G...None or UnspecifiedStraightNone or UnspecifiedNaNNaN1989131131
3167117416000.01327630859113299.01980NaNNaNA62...NaNNaNNaNStandardConventional1989131131
4132905622000.01336053408913299.01984NaNNaND3B...None or UnspecifiedPATLeverNaNNaN1989131131
\n", - "

5 rows × 57 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1646770 9500.0 1126363 8434 132 18.0 1974 \n", - "1 1821514 14000.0 1194089 10150 132 99.0 1980 \n", - "2 1505138 50000.0 1473654 4139 132 99.0 1978 \n", - "3 1671174 16000.0 1327630 8591 132 99.0 1980 \n", - "4 1329056 22000.0 1336053 4089 132 99.0 1984 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc ... Backhoe_Mounting \\\n", - "0 NaN NaN TD20 ... None or Unspecified \n", - "1 NaN NaN A66 ... NaN \n", - "2 NaN NaN D7G ... None or Unspecified \n", - "3 NaN NaN A62 ... NaN \n", - "4 NaN NaN D3B ... None or Unspecified \n", - "\n", - " Blade_Type Travel_Controls Differential_Type Steering_Controls \\\n", - "0 Straight None or Unspecified NaN NaN \n", - "1 NaN NaN Standard Conventional \n", - "2 Straight None or Unspecified NaN NaN \n", - "3 NaN NaN Standard Conventional \n", - "4 PAT Lever NaN NaN \n", - "\n", - " saleYear saleMonth saleDay saleDayofweek saleDayofyear \n", - "0 1989 1 17 1 17 \n", - "1 1989 1 31 1 31 \n", - "2 1989 1 31 1 31 \n", - "3 1989 1 31 1 31 \n", - "4 1989 1 31 1 31 \n", - "\n", - "[5 rows x 57 columns]" - ] - }, - "execution_count": 137, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Import preprocessed data to file\n", - "df_tmp = pd.read_csv(\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.csv\",\n", - " low_memory=False)\n", - "\n", - "df_tmp.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Excellent, looking at the tale end (the far right side) our processed DataFrame has the columns we added to it (the extra data features) but it's still missing values.\n", - "\n", - "But if we check `df_tmp.info()`..." - ] - }, - { - "cell_type": "code", - "execution_count": 138, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 57 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64\n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 392562 non-null float64\n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 147504 non-null float64\n", - " 8 UsageBand 73670 non-null object \n", - " 9 fiModelDesc 412698 non-null object \n", - " 10 fiBaseModel 412698 non-null object \n", - " 11 fiSecondaryDesc 271971 non-null object \n", - " 12 fiModelSeries 58667 non-null object \n", - " 13 fiModelDescriptor 74816 non-null object \n", - " 14 ProductSize 196093 non-null object \n", - " 15 fiProductClassDesc 412698 non-null object \n", - " 16 state 412698 non-null object \n", - " 17 ProductGroup 412698 non-null object \n", - " 18 ProductGroupDesc 412698 non-null object \n", - " 19 Drive_System 107087 non-null object \n", - " 20 Enclosure 412364 non-null object \n", - " 21 Forks 197715 non-null object \n", - " 22 Pad_Type 81096 non-null object \n", - " 23 Ride_Control 152728 non-null object \n", - " 24 Stick 81096 non-null object \n", - " 25 Transmission 188007 non-null object \n", - " 26 Turbocharged 81096 non-null object \n", - " 27 Blade_Extension 25983 non-null object \n", - " 28 Blade_Width 25983 non-null object \n", - " 29 Enclosure_Type 25983 non-null object \n", - " 30 Engine_Horsepower 25983 non-null object \n", - " 31 Hydraulics 330133 non-null object \n", - " 32 Pushblock 25983 non-null object \n", - " 33 Ripper 106945 non-null object \n", - " 34 Scarifier 25994 non-null object \n", - " 35 Tip_Control 25983 non-null object \n", - " 36 Tire_Size 97638 non-null object \n", - " 37 Coupler 220679 non-null object \n", - " 38 Coupler_System 44974 non-null object \n", - " 39 Grouser_Tracks 44875 non-null object \n", - " 40 Hydraulics_Flow 44875 non-null object \n", - " 41 Track_Type 102193 non-null object \n", - " 42 Undercarriage_Pad_Width 102916 non-null object \n", - " 43 Stick_Length 102261 non-null object \n", - " 44 Thumb 102332 non-null object \n", - " 45 Pattern_Changer 102261 non-null object \n", - " 46 Grouser_Type 102193 non-null object \n", - " 47 Backhoe_Mounting 80712 non-null object \n", - " 48 Blade_Type 81875 non-null object \n", - " 49 Travel_Controls 81877 non-null object \n", - " 50 Differential_Type 71564 non-null object \n", - " 51 Steering_Controls 71522 non-null object \n", - " 52 saleYear 412698 non-null int64 \n", - " 53 saleMonth 412698 non-null int64 \n", - " 54 saleDay 412698 non-null int64 \n", - " 55 saleDayofweek 412698 non-null int64 \n", - " 56 saleDayofyear 412698 non-null int64 \n", - "dtypes: float64(3), int64(10), object(44)\n", - "memory usage: 179.5+ MB\n" - ] - } - ], - "source": [ - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We notice that all of the `category` datatype columns are back to the `object` datatype.\n", - "\n", - "This is strange since we already converted the `object` datatype columns to `category`.\n", - "\n", - "Well then why did they change back?\n", - "\n", - "This happens because of the limitations of the CSV (`.csv`) file format, it doesn't preserve data types, rather it stores all the values as strings.\n", - "\n", - "So when we read in a CSV, pandas defaults to interpreting strings as `object` datatypes.\n", - "\n", - "Not to worry though, we can easily convert them to the `category` datatype as we did before.\n", - "\n", - "> **Note:** If you'd like to retain the datatypes when saving your data, you can use file formats such as [`parquet`](https://pandas.pydata.org/docs/user_guide/io.html#parquet) (Apache Parquet) and [`feather`](https://pandas.pydata.org/docs/user_guide/io.html#feather). These filetypes have several advantages over CSV in terms of processing speeds and storage size. However, data stored in these formats is not human-readable so you won't be able to open the files and inspect them without specific tools. For more on different file formats in pandas, see the [IO tools documentation page](https://pandas.pydata.org/docs/user_guide/io.html#)." - ] - }, - { - "cell_type": "code", - "execution_count": 139, - "metadata": {}, - "outputs": [], - "source": [ - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_object_dtype(content):\n", - " # Turn object columns into category datatype\n", - " df_tmp[label] = df_tmp[label].astype(\"category\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now if we wanted to preserve the datatypes of our data, we can save to `parquet` or `feather` format.\n", - "\n", - "Let's try using `parquet` format.\n", - "\n", - "To do so, we can use the [`pandas.DataFrame.to_parquet()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html) method.\n", - "\n", - "Files in the `parquet` format typically have the file extension of `.parquet`." - ] - }, - { - "cell_type": "code", - "execution_count": 155, - "metadata": {}, - "outputs": [], - "source": [ - "# To save to parquet format requires pyarrow or fastparquet (or both)\n", - "# Can install via `pip install pyarrow fastparquet`\n", - "df_tmp.to_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet\", \n", - " engine=\"auto\") # \"auto\" will automatically use pyarrow or fastparquet, defaulting to pyarrow first" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wonderful! Now let's try importing our DataFrame from the `parquet` format and check it using `df_tmp.info()`." - ] - }, - { - "cell_type": "code", - "execution_count": 157, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\n", - "RangeIndex: 412698 entries, 0 to 412697\n", - "Data columns (total 59 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 SalesID 412698 non-null int64 \n", - " 1 SalePrice 412698 non-null float64 \n", - " 2 MachineID 412698 non-null int64 \n", - " 3 ModelID 412698 non-null int64 \n", - " 4 datasource 412698 non-null int64 \n", - " 5 auctioneerID 412698 non-null float64 \n", - " 6 YearMade 412698 non-null int64 \n", - " 7 MachineHoursCurrentMeter 412698 non-null float64 \n", - " 8 UsageBand 73670 non-null category\n", - " 9 fiModelDesc 412698 non-null category\n", - " 10 fiBaseModel 412698 non-null category\n", - " 11 fiSecondaryDesc 271971 non-null category\n", - " 12 fiModelSeries 58667 non-null category\n", - " 13 fiModelDescriptor 74816 non-null category\n", - " 14 ProductSize 196093 non-null category\n", - " 15 fiProductClassDesc 412698 non-null category\n", - " 16 state 412698 non-null category\n", - " 17 ProductGroup 412698 non-null category\n", - " 18 ProductGroupDesc 412698 non-null category\n", - " 19 Drive_System 107087 non-null category\n", - " 20 Enclosure 412364 non-null category\n", - " 21 Forks 197715 non-null category\n", - " 22 Pad_Type 81096 non-null category\n", - " 23 Ride_Control 152728 non-null category\n", - " 24 Stick 81096 non-null category\n", - " 25 Transmission 188007 non-null category\n", - " 26 Turbocharged 81096 non-null category\n", - " 27 Blade_Extension 25983 non-null category\n", - " 28 Blade_Width 25983 non-null category\n", - " 29 Enclosure_Type 25983 non-null category\n", - " 30 Engine_Horsepower 25983 non-null category\n", - " 31 Hydraulics 330133 non-null category\n", - " 32 Pushblock 25983 non-null category\n", - " 33 Ripper 106945 non-null category\n", - " 34 Scarifier 25994 non-null category\n", - " 35 Tip_Control 25983 non-null category\n", - " 36 Tire_Size 97638 non-null category\n", - " 37 Coupler 220679 non-null category\n", - " 38 Coupler_System 44974 non-null category\n", - " 39 Grouser_Tracks 44875 non-null category\n", - " 40 Hydraulics_Flow 44875 non-null category\n", - " 41 Track_Type 102193 non-null category\n", - " 42 Undercarriage_Pad_Width 102916 non-null category\n", - " 43 Stick_Length 102261 non-null category\n", - " 44 Thumb 102332 non-null category\n", - " 45 Pattern_Changer 102261 non-null category\n", - " 46 Grouser_Type 102193 non-null category\n", - " 47 Backhoe_Mounting 80712 non-null category\n", - " 48 Blade_Type 81875 non-null category\n", - " 49 Travel_Controls 81877 non-null category\n", - " 50 Differential_Type 71564 non-null category\n", - " 51 Steering_Controls 71522 non-null category\n", - " 52 saleYear 412698 non-null int64 \n", - " 53 saleMonth 412698 non-null int64 \n", - " 54 saleDay 412698 non-null int64 \n", - " 55 saleDayofweek 412698 non-null int64 \n", - " 56 saleDayofyear 412698 non-null int64 \n", - " 57 auctioneerID_is_missing 412698 non-null int64 \n", - " 58 MachineHoursCurrentMeter_is_missing 412698 non-null int64 \n", - "dtypes: category(44), float64(3), int64(12)\n", - "memory usage: 66.4 MB\n" - ] - } - ], - "source": [ - "# Read in df_tmp from parquet format\n", - "df_tmp = pd.read_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories.parquet\",\n", - " engine=\"auto\")\n", - "\n", - "# Using parquet format, datatypes are preserved\n", - "df_tmp.info()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice! Looks like using the `parquet` format preserved all of our datatypes.\n", - "\n", - "For more on the `parquet` and `feather` formats, be sure to check out the [pandas IO (input/output) documentation](https://pandas.pydata.org/docs/user_guide/io.html#parquet)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Finding and filling missing values\n", - "\n", - "Let's remind ourselves of the missing values by getting the top 20 columns with the most missing values." - ] - }, - { - "cell_type": "code", - "execution_count": 146, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Blade_Width 386715\n", - "Engine_Horsepower 386715\n", - "Tip_Control 386715\n", - "Pushblock 386715\n", - "Blade_Extension 386715\n", - "Enclosure_Type 386715\n", - "Scarifier 386704\n", - "Hydraulics_Flow 367823\n", - "Grouser_Tracks 367823\n", - "Coupler_System 367724\n", - "fiModelSeries 354031\n", - "Steering_Controls 341176\n", - "Differential_Type 341134\n", - "UsageBand 339028\n", - "fiModelDescriptor 337882\n", - "Backhoe_Mounting 331986\n", - "Stick 331602\n", - "Turbocharged 331602\n", - "Pad_Type 331602\n", - "Blade_Type 330823\n", - "dtype: int64" - ] - }, - "execution_count": 146, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check missing values\n", - "df_tmp.isna().sum().sort_values(ascending=False)[:20]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, it seems like there are a fair few columns with missing values and there are several datatypes across these columns (numerical, categorical).\n", - "\n", - "How about we break the problem down and work on filling each datatype separately?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Filling missing numerical values\n", - "\n", - "There's no set way to fill missing values in your dataset.\n", - "\n", - "And unless you're filling the missing samples with newly discovered actual data, every way you fill your dataset's missing values will introduce some sort of noise or bias. \n", - "\n", - "We'll start by filling the missing numerical values in ourdataet.\n", - "\n", - "To do this, we'll first find the numeric datatype columns.\n", - "\n", - "We can do by looping through the columns in our DataFrame and calling [`pd.api.types.is_numeric_dtype(arr_or_dtype)`](https://pandas.pydata.org/docs/reference/api/pandas.api.types.is_numeric_dtype.html) on them." - ] - }, - { - "cell_type": "code", - "execution_count": 147, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Column dtype: int64 | Example value: [1146304] | Example value dtype: integer\n", - "Column name: SalePrice | Column dtype: float64 | Example value: [13000.] | Example value dtype: floating\n", - "Column name: MachineID | Column dtype: int64 | Example value: [1408211] | Example value dtype: integer\n", - "Column name: ModelID | Column dtype: int64 | Example value: [3856] | Example value dtype: integer\n", - "Column name: datasource | Column dtype: int64 | Example value: [136] | Example value dtype: integer\n", - "Column name: auctioneerID | Column dtype: float64 | Example value: [1.] | Example value dtype: floating\n", - "Column name: YearMade | Column dtype: int64 | Example value: [2003] | Example value dtype: integer\n", - "Column name: MachineHoursCurrentMeter | Column dtype: float64 | Example value: [nan] | Example value dtype: floating\n", - "Column name: saleYear | Column dtype: int64 | Example value: [2010] | Example value dtype: integer\n", - "Column name: saleMonth | Column dtype: int64 | Example value: [11] | Example value dtype: integer\n", - "Column name: saleDay | Column dtype: int64 | Example value: [3] | Example value dtype: integer\n", - "Column name: saleDayofweek | Column dtype: int64 | Example value: [4] | Example value dtype: integer\n", - "Column name: saleDayofyear | Column dtype: int64 | Example value: [330] | Example value dtype: integer\n" - ] - } - ], - "source": [ - "# Find numeric columns \n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " # Check datatype of target column\n", - " column_datatype = df_tmp[label].dtype.name\n", - "\n", - " # Get random sample from column values\n", - " example_value = content.sample(1).values\n", - "\n", - " # Infer random sample datatype\n", - " example_value_dtype = pd.api.types.infer_dtype(example_value)\n", - " print(f\"Column name: {label} | Column dtype: {column_datatype} | Example value: {example_value} | Example value dtype: {example_value_dtype}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! Looks like we've got a mixture of `int64` and `float64` numerical datatypes.\n", - "\n", - "Now how about we find out which numeric columns are missing values?\n", - "\n", - "We can do so by using `pandas.isnull(obj).sum()` to detect and sum the missing values in a given array-like object (in our case, the data in a target column).\n", - "\n", - "Let's loop through our DataFrame columns, find the numeric datatypes and check if they have any missing values." - ] - }, - { - "cell_type": "code", - "execution_count": 148, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Has missing values: False\n", - "Column name: SalePrice | Has missing values: False\n", - "Column name: MachineID | Has missing values: False\n", - "Column name: ModelID | Has missing values: False\n", - "Column name: datasource | Has missing values: False\n", - "Column name: auctioneerID | Has missing values: True\n", - "Column name: YearMade | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter | Has missing values: True\n", - "Column name: saleYear | Has missing values: False\n", - "Column name: saleMonth | Has missing values: False\n", - "Column name: saleDay | Has missing values: False\n", - "Column name: saleDayofweek | Has missing values: False\n", - "Column name: saleDayofyear | Has missing values: False\n" - ] - } - ], - "source": [ - "# Check for which numeric columns have null values\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " print(f\"Column name: {label} | Has missing values: {True}\")\n", - " else:\n", - " print(f\"Column name: {label} | Has missing values: {False}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay, it looks like our `auctioneerID` and `MachineHoursCurrentMeter` columns have missing numeric values.\n", - "\n", - "As previously discussed, there are many ways to fill missing values.\n", - "\n", - "For missing numeric values, some potential options are:\n", - "\n", - "| **Method** | **Pros** | **Cons** |\n", - "|-----|-----|-----|\n", - "| **Fill with mean of column** | - Easy to calculate/implement
- Retains overall data distribution | - Averages out variation
- Affected by outliers (e.g. if one value is much higher/lower than others) |\n", - "| **Fill with median of column** | - Easy to calculate/implement
- Robust to outliers
- Preserves center of data | - Ignores data distribution shape |\n", - "| **Fill with mode of column** | - Easy to calculate/implement
- More useful for categorical-like data | - May not make sense for continuous/numerical data |\n", - "| **Fill with 0 (or another constant)** | - Simple to implement
- Useful in certain contexts like counts | - Introduces bias (e.g. if 0 was a value that meant something)
- Skews data (e.g. if many missing values, replacing all with 0 makes it look like that's the most common value) |\n", - "| **Forward/Backward fill (use previous/future values to fill future/previous values)** | - Maintains temporal continuity (for time series) | - Assumes data is continuous, which may not be valid |\n", - "| **Use a calculation from other columns** | - Takes existing information and reinterprets it | - Can result in unlikely outputs if calculations are not continuous | \n", - "| **Interpolate (e.g. like dragging a cell in Excel/Google Sheets)** | - Captures trends
- Suitable for ordered data | - Can introduce errors
- May assume linearity (data continues in a straight line) |\n", - "| **Drop missing values** | - Ensures complete data (only use samples with all information)
- Useful for small datasets | - Can result in data loss (e.g. if many missing values are scattered across columns, data size can be dramatically reduced)
- Reduces dataset size |\n", - "\n", - "Which method you choose will be dataset and problem dependant and will likely require several phases of experimentation to see what works and what doesn't.\n", - "\n", - "For now, we'll fill our missing numeric values with the median value of the target column.\n", - "\n", - "We'll also add a binary column (0 or 1) with rows reflecting whether or not a value was missing.\n", - "\n", - "For example, `MachineHoursCurrentMeter_is_missing` will be a column with rows which have a value of `0` if that row's `MachineHoursCurrentMeter` column was *not* missing and `1` if it was.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 149, - "metadata": {}, - "outputs": [], - "source": [ - "# Fill missing numeric values with the median of the target column\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " \n", - " # Add a binary column which tells if the data was missing our not\n", - " df_tmp[label+\"_is_missing\"] = pd.isnull(content).astype(int) # this will add a 0 or 1 value to rows with missing values (e.g. 0 = not missing, 1 = missing)\n", - "\n", - " # Fill missing numeric values with median since it's more robust than the mean\n", - " df_tmp[label] = content.fillna(content.median())" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Why add a binary column indicating whether the data was missing or not?\n", - "\n", - "We can easily fill all of the missing numeric values in our dataset with the median. \n", - "\n", - "However, a numeric value may be missing for a reason. \n", - "\n", - "Adding a binary column which indicates whether the value was missing or not helps to retain this information. It also means we can inspect these rows later on." - ] - }, - { - "cell_type": "code", - "execution_count": 150, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Travel_ControlsDifferential_TypeSteering_ControlssaleYearsaleMonthsaleDaysaleDayofweeksaleDayofyearauctioneerID_is_missingMachineHoursCurrentMeter_is_missing
185062155927239000.048081435421321.020020.0NaN420D...NaNNaNNaN2005616316701
32362125654222000.01451409324713221.019850.0NaN880D...NaNNaNNaN19941010028301
100005170731521000.01439450158831321.019930.0NaN217S...NaNNaNNaN2000929427301
131755127159310500.0138084931121329.010000.0NaN1845C...NaNNaNNaN2002621417201
19690131462582500.089824741571321.019880.0NaND9N...None or UnspecifiedNaNNaN19921214034901
\n", - "

5 rows × 59 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "185062 1559272 39000.0 480814 3542 132 1.0 \n", - "32362 1256542 22000.0 1451409 3247 132 21.0 \n", - "100005 1707315 21000.0 1439450 15883 132 1.0 \n", - "131755 1271593 10500.0 1380849 3112 132 9.0 \n", - "19690 1314625 82500.0 898247 4157 132 1.0 \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "185062 2002 0.0 NaN 420D ... \n", - "32362 1985 0.0 NaN 880D ... \n", - "100005 1993 0.0 NaN 217S ... \n", - "131755 1000 0.0 NaN 1845C ... \n", - "19690 1988 0.0 NaN D9N ... \n", - "\n", - " Travel_Controls Differential_Type Steering_Controls saleYear \\\n", - "185062 NaN NaN NaN 2005 \n", - "32362 NaN NaN NaN 1994 \n", - "100005 NaN NaN NaN 2000 \n", - "131755 NaN NaN NaN 2002 \n", - "19690 None or Unspecified NaN NaN 1992 \n", - "\n", - " saleMonth saleDay saleDayofweek saleDayofyear auctioneerID_is_missing \\\n", - "185062 6 16 3 167 0 \n", - "32362 10 10 0 283 0 \n", - "100005 9 29 4 273 0 \n", - "131755 6 21 4 172 0 \n", - "19690 12 14 0 349 0 \n", - "\n", - " MachineHoursCurrentMeter_is_missing \n", - "185062 1 \n", - "32362 1 \n", - "100005 1 \n", - "131755 1 \n", - "19690 1 \n", - "\n", - "[5 rows x 59 columns]" - ] - }, - "execution_count": 150, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Show rows where MachineHoursCurrentMeter_is_missing == 1\n", - "df_tmp[df_tmp[\"MachineHoursCurrentMeter_is_missing\"] == 1].sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Missing numeric values filled!\n", - "\n", - "How about we check again whether or not the numeric columns have missing values?" - ] - }, - { - "cell_type": "code", - "execution_count": 151, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Column name: SalesID | Has missing values: False\n", - "Column name: SalePrice | Has missing values: False\n", - "Column name: MachineID | Has missing values: False\n", - "Column name: ModelID | Has missing values: False\n", - "Column name: datasource | Has missing values: False\n", - "Column name: auctioneerID | Has missing values: False\n", - "Column name: YearMade | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter | Has missing values: False\n", - "Column name: saleYear | Has missing values: False\n", - "Column name: saleMonth | Has missing values: False\n", - "Column name: saleDay | Has missing values: False\n", - "Column name: saleDayofweek | Has missing values: False\n", - "Column name: saleDayofyear | Has missing values: False\n", - "Column name: auctioneerID_is_missing | Has missing values: False\n", - "Column name: MachineHoursCurrentMeter_is_missing | Has missing values: False\n" - ] - } - ], - "source": [ - "# Check for which numeric columns have null values\n", - "for label, content in df_tmp.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " print(f\"Column name: {label} | Has missing values: {True}\")\n", - " else:\n", - " print(f\"Column name: {label} | Has missing values: {False}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woohoo! Numeric missing values filled!\n", - "\n", - "And thanks to our binary `_is_missing` columns, we can even check how many were missing." - ] - }, - { - "cell_type": "code", - "execution_count": 152, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "auctioneerID_is_missing\n", - "0 392562\n", - "1 20136\n", - "Name: count, dtype: int64" - ] - }, - "execution_count": 152, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Check to see how many examples in the auctioneerID were missing\n", - "df_tmp.auctioneerID_is_missing.value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Filling missing categorical values \n", - "\n", - "UPTOHERE \n", - "- filling missing categorical variables\n", - "- save the data again with numeric + filled values\n", - "- fit model \n", - "- eval model\n", - "- discuss the mistake (mixing train + val datasets) + how to fix it\n", - "- continue into splitting data section\n", - "\n", - "Now we've filled the numeric values, we'll do the same with the categorical values whilst ensuring that they are all numerical too.\n", - "\n", - "Let's first investigate the columns which *aren't* numeric (we've already worked with these). " - ] - }, - { - "cell_type": "code", - "execution_count": 169, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Columns which are not numeric:\n", - "Column name: UsageBand | Column dtype: category\n", - "Column name: fiModelDesc | Column dtype: category\n", - "Column name: fiBaseModel | Column dtype: category\n", - "Column name: fiSecondaryDesc | Column dtype: category\n", - "Column name: fiModelSeries | Column dtype: category\n", - "Column name: fiModelDescriptor | Column dtype: category\n", - "Column name: ProductSize | Column dtype: category\n", - "Column name: fiProductClassDesc | Column dtype: category\n", - "Column name: state | Column dtype: category\n", - "Column name: ProductGroup | Column dtype: category\n", - "Column name: ProductGroupDesc | Column dtype: category\n", - "Column name: Drive_System | Column dtype: category\n", - "Column name: Enclosure | Column dtype: category\n", - "Column name: Forks | Column dtype: category\n", - "Column name: Pad_Type | Column dtype: category\n", - "Column name: Ride_Control | Column dtype: category\n", - "Column name: Stick | Column dtype: category\n", - "Column name: Transmission | Column dtype: category\n", - "Column name: Turbocharged | Column dtype: category\n", - "Column name: Blade_Extension | Column dtype: category\n", - "Column name: Blade_Width | Column dtype: category\n", - "Column name: Enclosure_Type | Column dtype: category\n", - "Column name: Engine_Horsepower | Column dtype: category\n", - "Column name: Hydraulics | Column dtype: category\n", - "Column name: Pushblock | Column dtype: category\n", - "Column name: Ripper | Column dtype: category\n", - "Column name: Scarifier | Column dtype: category\n", - "Column name: Tip_Control | Column dtype: category\n", - "Column name: Tire_Size | Column dtype: category\n", - "Column name: Coupler | Column dtype: category\n", - "Column name: Coupler_System | Column dtype: category\n", - "Column name: Grouser_Tracks | Column dtype: category\n", - "Column name: Hydraulics_Flow | Column dtype: category\n", - "Column name: Track_Type | Column dtype: category\n", - "Column name: Undercarriage_Pad_Width | Column dtype: category\n", - "Column name: Stick_Length | Column dtype: category\n", - "Column name: Thumb | Column dtype: category\n", - "Column name: Pattern_Changer | Column dtype: category\n", - "Column name: Grouser_Type | Column dtype: category\n", - "Column name: Backhoe_Mounting | Column dtype: category\n", - "Column name: Blade_Type | Column dtype: category\n", - "Column name: Travel_Controls | Column dtype: category\n", - "Column name: Differential_Type | Column dtype: category\n", - "Column name: Steering_Controls | Column dtype: category\n" - ] - } - ], - "source": [ - "# Check columns which aren't numeric\n", - "print(f\"[INFO] Columns which are not numeric:\")\n", - "for label, content in df_tmp.items():\n", - " if not pd.api.types.is_numeric_dtype(content):\n", - " print(f\"Column name: {label} | Column dtype: {df_tmp[label].dtype.name}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Okay, we've got plenty of category type columns.\n", - "\n", - "Let's now write some code to fill the missing categorical values as well as ensure they are numerical (non-string). \n", - "\n", - "To do so, we'll:\n", - "\n", - "1. Create a blank column to category dictionary, we'll use this to store categorical value names (e.g. their string name) as well as their categorical code. We'll end with a dictionary of dictionaries in the form `{\"column_name\": {category_code: \"category_value\"...}...}`.\n", - "2. Loop through the items in the DataFrame.\n", - "3. Check if the column is numeric or not.\n", - "4. Add a binary column in the form `ORIGINAL_COLUMN_NAME_is_missing` with a `0` or `1` value for if the row had a missing value.\n", - "5. Ensure the column values are in the `pd.Categorical` datatype and get their category codes with `pd.Series.cat.codes` (we'll add `1` to these values since pandas defaults to assigning `-1` to `NaN` values, we'll use `0` instead).\n", - "6. Turn the column categories and column category codes from 5 into a dictionary with Python's [`dict(zip(category_names, category_codes))`](https://docs.python.org/3.3/library/functions.html#zip) and save this to the blank dictionary from 1 with the target column name as key.\n", - "7. Set the target column value to the numerical category values from 5.\n", - "\n", - "Phew!\n", - "\n", - "That's a fair few steps but nothing we can't handle.\n", - "\n", - "Let's do it!" - ] - }, - { - "cell_type": "code", - "execution_count": 171, - "metadata": {}, - "outputs": [], - "source": [ - "# 1. Create a dictionary to store column to category values (e.g. we turn our category types into numbers but we keep a record so we can go back)\n", - "column_to_category_dict = {} \n", - "\n", - "# 2. Turn categorical variables into numbers\n", - "for label, content in df_tmp.items():\n", - "\n", - " # 3. Check columns which *aren't* numeric\n", - " if not pd.api.types.is_numeric_dtype(content):\n", - "\n", - " # 4. Add binary column to inidicate whether sample had missing value\n", - " df_tmp[label+\"_is_missing\"] = pd.isnull(content).astype(int)\n", - "\n", - " # 5. Ensure content is categorical and get its category codes\n", - " content_categories = pd.Categorical(content)\n", - " content_category_codes = content_categories.codes + 1 # prevents -1 (the default for NaN values) from being used for missing values (we'll treat missing values as 0)\n", - "\n", - " # 6. Add column key to dictionary with code: category mapping per column\n", - " column_to_category_dict[label] = dict(zip(content_category_codes, content_categories))\n", - " \n", - " # 7. Set the column to the numerical values (the category code value) \n", - " df_tmp[label] = content_category_codes " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ho ho! No errors!\n", - "\n", - "Let's check out a few random samples of our DataFrame." - ] - }, - { - "cell_type": "code", - "execution_count": 174, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Undercarriage_Pad_Width_is_missingStick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missing
157621168764762500.01451688116301327.019960.002769...0000011111
24770149116334500.01177842410713299.019860.002178...1111100011
351574243292611500.01256467360331361.020051712.034259...1111111111
396421632034551000.0187729958891491.020060.003374...1111111100
378776265565816000.0185247148971491.019964436.024615...1111100011
\n", - "

5 rows × 103 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID \\\n", - "157621 1687647 62500.0 1451688 11630 132 7.0 \n", - "24770 1491163 34500.0 1177842 4107 132 99.0 \n", - "351574 2432926 11500.0 1256467 36033 136 1.0 \n", - "396421 6320345 51000.0 1877299 5889 149 1.0 \n", - "378776 2655658 16000.0 1852471 4897 149 1.0 \n", - "\n", - " YearMade MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "157621 1996 0.0 0 2769 ... \n", - "24770 1986 0.0 0 2178 ... \n", - "351574 2005 1712.0 3 4259 ... \n", - "396421 2006 0.0 0 3374 ... \n", - "378776 1996 4436.0 2 4615 ... \n", - "\n", - " Undercarriage_Pad_Width_is_missing Stick_Length_is_missing \\\n", - "157621 0 0 \n", - "24770 1 1 \n", - "351574 1 1 \n", - "396421 1 1 \n", - "378776 1 1 \n", - "\n", - " Thumb_is_missing Pattern_Changer_is_missing Grouser_Type_is_missing \\\n", - "157621 0 0 0 \n", - "24770 1 1 1 \n", - "351574 1 1 1 \n", - "396421 1 1 1 \n", - "378776 1 1 1 \n", - "\n", - " Backhoe_Mounting_is_missing Blade_Type_is_missing \\\n", - "157621 1 1 \n", - "24770 0 0 \n", - "351574 1 1 \n", - "396421 1 1 \n", - "378776 0 0 \n", - "\n", - " Travel_Controls_is_missing Differential_Type_is_missing \\\n", - "157621 1 1 \n", - "24770 0 1 \n", - "351574 1 1 \n", - "396421 1 0 \n", - "378776 0 1 \n", - "\n", - " Steering_Controls_is_missing \n", - "157621 1 \n", - "24770 1 \n", - "351574 1 \n", - "396421 0 \n", - "378776 1 \n", - "\n", - "[5 rows x 103 columns]" - ] - }, - "execution_count": 174, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_tmp.sample(5)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful! Looks like our data is all in numerical form.\n", - "\n", - "How about we investigate an item from our `column_to_category_dict`?\n", - "\n", - "This will show the mapping from numerical value to category (most likely a string) value." - ] - }, - { - "cell_type": "code", - "execution_count": 184, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "0 -> nan\n", - "1 -> High\n", - "2 -> Low\n", - "3 -> Medium\n" - ] - } - ], - "source": [ - "# Check the UsageBand (measure of bulldozer usage)\n", - "for key, value in sorted(column_to_category_dict[\"UsageBand\"].items()): # note: calling sorted() on dictionary.items() sorts the dictionary by keys \n", - " print(f\"{key} -> {value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> **Note:** Categorical values do not necessarily have order. They are strictly a mapping from number to value. In this case, our categorical values are mapped in numerical order. If you feel that the order of a value may influence a model in a negative way (e.g. `1 -> High` is *lower* than `3 -> Medium` but should be *higher*), you may want to look into ordering the values in a particular way or using a different numerical encoding technique such as [one-hot encoding](https://en.wikipedia.org/wiki/One-hot).\n", - "\n", - "And we can do the same for the `state` column values." - ] - }, - { - "cell_type": "code", - "execution_count": 182, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "1 -> Alabama\n", - "2 -> Alaska\n", - "3 -> Arizona\n", - "4 -> Arkansas\n", - "5 -> California\n", - "6 -> Colorado\n", - "7 -> Connecticut\n", - "8 -> Delaware\n", - "9 -> Florida\n", - "10 -> Georgia\n" - ] - } - ], - "source": [ - "# Check the first 10 state column values\n", - "for key, value in sorted(column_to_category_dict[\"state\"].items())[:10]:\n", - " print(f\"{key} -> {value}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful!\n", - "\n", - "How about we check to see all of the missing values have been filled?" - ] - }, - { - "cell_type": "code", - "execution_count": 188, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Total missing values: 0 - Woohoo! Let's build a model!\n" - ] - } - ], - "source": [ - "# Check total number of missing values\n", - "total_missing_values = df_tmp.isna().sum().sum()\n", - "\n", - "if total_missing_values == 0:\n", - " print(f\"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!\")\n", - "else:\n", - " print(f\"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Saving our preprocessed data (part 2)\n", - "\n", - "One more step before we train new model!\n", - "\n", - "Let's save our work so far so we could re-import our preprocessed dataset if we wanted to.\n", - "\n", - "We'll save it to the `parquet` format again, this time with a suffix to show we've filled the missing values." - ] - }, - { - "cell_type": "code", - "execution_count": 190, - "metadata": {}, - "outputs": [], - "source": [ - "# Save preprocessed data with object values as categories as well as missing values filled\n", - "df_tmp.to_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet\",\n", - " engine=\"auto\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And to make sure it worked, we can re-import it." - ] - }, - { - "cell_type": "code", - "execution_count": 191, - "metadata": {}, - "outputs": [], - "source": [ - "# Read in preprocessed dataset\n", - "df_tmp = pd.read_parquet(path=\"../data/bluebook-for-bulldozers/TrainAndValid_object_values_as_categories_and_missing_values_filled.parquet\",\n", - " engine=\"auto\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Does it have any missing values?" - ] - }, - { - "cell_type": "code", - "execution_count": 192, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Total missing values: 0 - Woohoo! Let's build a model!\n" - ] - } - ], - "source": [ - "# Check total number of missing values\n", - "total_missing_values = df_tmp.isna().sum().sum()\n", - "\n", - "if total_missing_values == 0:\n", - " print(f\"[INFO] Total missing values: {total_missing_values} - Woohoo! Let's build a model!\")\n", - "else:\n", - " print(f\"[INFO] Uh ohh... total missing values: {total_missing_values} - Perhaps we might have to retrace our steps to fill the values?\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Checkpoint reached!\n", - "\n", - "We've turned all of our data into numbers as well as filled the missing values, time to try fitting a model to it again." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - Fitting a machine learning model to our preprocessed data\n", - "\n", - "UPTOHERE \n", - "- fitting a model to the data... (could fit to a subset for quicker times...)\n", - "- what's wrong with it? (fitting and evaluting on the same data)\n", - "\n", - "Now all of our data is numeric and there are no missing values, we should be able to fit a machine learning model to it!\n", - "\n", - "Let's reinstantiate our trusty [`sklearn.ensemble.RandomForestRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) model.\n", - "\n", - "Since our dataset has a substantial amount of rows (~400k+), let's first make sure the model will work on a smaller sample of 1000 or so.\n", - "\n", - "> **Note:** It's common practice on machine learning problems to see if your experiments will work on smaller scale problems (e.g. smaller amounts of data) before scaling them up to the full dataset. This practice enables you to try many different kinds of experiments with faster runtimes. The benefit of this is that you can figure out what doesn't work before spending more time on what does.\n", - "\n", - "Our `X` values (features) will be every column except the `\"SalePrice\"` column.\n", - "\n", - "And our `y` values (labels) will be the entirety of the `\"SalePrice\"` column.\n", - "\n", - "\n", - "We'll time how long our smaller experiment takes using the [magic function `%%time`](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html) and placing it at the top of the notebook cell.\n", - "\n", - "> **Note:** You can find out more about the `%%time` magic command by typing `%%time?` (note the question mark on the end) in a notebook cell.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 204, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 1.01 s, sys: 68.6 ms, total: 1.07 s\n", - "Wall time: 387 ms\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(n_jobs=-1)" - ] - }, - "execution_count": 204, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Sample 1000 samples with random state 42 for reproducibility\n", - "df_tmp_sample_1k = df_tmp.sample(n=1000, random_state=42)\n", - "\n", - "# Instantiate a model\n", - "model = RandomForestRegressor(n_jobs=-1) # use -1 to utilise all available processors\n", - "\n", - "# Create features and labels\n", - "X_sample_1k = df_tmp_sample_1k.drop(\"SalePrice\", axis=1) # use all columns except SalePrice as X values\n", - "y_sample_1k = df_tmp_sample_1k[\"SalePrice\"] # use SalePrice as y values (target variable)\n", - "\n", - "# Fit the model to the sample data\n", - "model.fit(X=X_sample_1k, \n", - " y=y_sample_1k) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Woah! It looks like things worked!\n", - "\n", - "And quite quick too (since we're only using a relatively small number of rows).\n", - "\n", - "How about we score our model?\n", - "\n", - "We can do so using the built-in method `score()`. By default, `sklearn.ensemble.RandomForestRegressor` uses [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) ($R^2$ or R-squared) as the evaluation metric (higher is better, with a score of 1.0 being perfect)." - ] - }, - { - "cell_type": "code", - "execution_count": 205, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model score on 1000 samples: 0.9552290424939952\n" - ] - } - ], - "source": [ - "# Evaluate the model\n", - "model_sample_1k_score = model.score(X=X_sample_1k,\n", - " y=y_sample_1k)\n", - "\n", - "print(f\"[INFO] Model score on {len(df_tmp_sample_1k)} samples: {model_sample_1k_score}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wow, it looks like our model got a pretty good score on only 1000 samples (the best possible score it could achieve would've been 1.0). \n", - "\n", - "How about we try our model on the whole dataset?" - ] - }, - { - "cell_type": "code", - "execution_count": 206, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 9min 5s, sys: 3.05 s, total: 9min 8s\n", - "Wall time: 1min 11s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(n_jobs=-1)" - ] - }, - "execution_count": 206, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "\n", - "# Instantiate model\n", - "model = RandomForestRegressor(n_jobs=-1) # note: this could take quite a while depending on your machine (it took ~1.5 minutes on my MacBook Pro M1 Pro with 10 cores)\n", - "\n", - "# Create features and labels with entire dataset\n", - "X_all = df_tmp.drop(\"SalePrice\", axis=1)\n", - "y_all = df_tmp[\"SalePrice\"]\n", - "\n", - "# Fit the model\n", - "model.fit(X=X_all, \n", - " y=y_all)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, that took a little bit longer than fitting on 1000 samples (but that's too be expected, as many more calculations had to be made).\n", - "\n", - "There's a reason we used `n_jobs=-1` too.\n", - "\n", - "If we stuck with the default of `n_jobs=None` (the same as `n_jobs=1`), it would've taken much longer.\n", - "\n", - "| Configuration (MacBook Pro M1 Pro, 10 Cores) | CPU Times (User) | CPU Times (Sys) | CPU Times (Total) | Wall Time |\n", - "|-----|-----|-----|-----|-----|\n", - "| `n_jobs=-1` (all cores) | 9min 14s | 3.85s | 9min 18s | 1min 15s |\n", - "| `n_jobs=None` (default) | 7min 14s | 1.75s | 7min 16s | 7min 25s |\n", - "\n", - "And as we've discussed many times, one of the main goals when starting a machine learning project is to reduce your time between experiments.\n", - "\n", - "How about we score the model trained on all of the data?" - ] - }, - { - "cell_type": "code", - "execution_count": 207, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "[INFO] Model score on 412698 samples: 0.98752722160166\n" - ] - } - ], - "source": [ - "# Evaluate the model\n", - "model_sample_all_score = model.score(X=X_all,\n", - " y=y_all)\n", - "\n", - "print(f\"[INFO] Model score on {len(df_tmp)} samples: {model_sample_all_score}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "An even better score!\n", - "\n", - "Oh wait...\n", - "\n", - "Oh no...\n", - "\n", - "I think we've got an error... (you might've noticed it already)\n", - "\n", - "Why might this metric be unreliable?\n", - "\n", - "Hint: Compare the data we trained on versus the data we evaluated on." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### TK - A big (but fixable) mistake \n", - "\n", - "One of the hard things about bugs in machine learning projects is that they are often silent.\n", - "\n", - "For example, our model seems to have fit the data with no issues and then evaluated with a good score.\n", - "\n", - "So what's wrong?\n", - "\n", - "It seems we've stumbled across one of the most common bugs in machine learning and that's **data leakage** (data from the training set leaking into the validation/testing sets).\n", - "\n", - "We've evaluated our model on the same data it was trained on.\n", - "\n", - "This isn't the model's fault either.\n", - "\n", - "It's our fault.\n", - "\n", - "Right back at the start we imported a file called `TrainAndValid.csv`, this contains both the training and validation data.\n", - "\n", - "And while we preprocessed it to make sure there were no missing values and the samples were all numeric, we never split the data into separate training and validation splits.\n", - "\n", - "The right workflow would've been to train the model on the training split and then evaluate it on the *unseen* validation split.\n", - "\n", - "Our evaluation scores above are quite good but they can't necessarily be trusted to be replicated on unseen data (data in the real world) because they've been obtained by evaluating the model on data its already seen. \n", - "\n", - "This would be the equivalent of a final exam at university containing all of the same questions as the practice exam without any changes. \n", - "\n", - "Not to worry, we can fix this!\n", - "\n", - "How?\n", - "\n", - "We can import the training and validation datasets separately via `Train.csv` and `Valid.csv` respectively.\n", - "\n", - "Or we could import `TrainAndValid.csv` and perform the appropriate splits according the original [Kaggle competition page](https://www.kaggle.com/c/bluebook-for-bulldozers/data) (training data includes all samples prior to 2012 and validation data includes samples from January 1 2012 to April 30 2012).\n", - "\n", - "In both methods, we'll have to perform the same preprocessing steps we've done so far.\n", - "\n", - "Except because the validation data is supposed to remain as *unseen* data, we'll only use information from the training set to preprocess the validation set (and not mix the two). \n", - "\n", - "We'll work on this in the next section.\n", - "\n", - "The takeaway?\n", - "\n", - "Always (if possible) **create appropriate data splits at the start of a project**.\n", - "\n", - "Because it's one thing to train a machine learning model but if you can't evaluate it properly (on unseen data), how can you know how it'll perform (or may perform) in the real world on new and unseen data?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. TK - Splitting data into train/valid sets\n", - "\n", - "UPTOHERE \n", - "* TK - trying to fit a model forced us to prepare our dataset in a way that it could be used with a model but caused us to make the mistake of mixing the training/validation data (perhaps this was on purpose...)\n", - "* TK - can just import the Train/Valid CSVs separately and fill with Scikit-Learn imputers\n", - "\n", - "* Good new is, we get to practice preprocessing our data again. This time with separate training and validation splits. Last time we used pandas to make ensure our data was all numeric and had no missing values. But using pandas in this way can be a bit of an issue with larger scale datasets or when new data is introduced. How about this time we use Scikit-Learn and make a reproducible pipeline for our data preprocessing needs?\n", - "\n", - "* Next steps:\n", - "- import train/validation data separately\n", - "- create scikit-learn data filling pipeline for fitting to training data (turn all data numeric + fill missing values) \n", - "- use this preprocessing pipeline for applying to to validation data (e.g. `fit_transform` on train data -> only `transform` on validation data)\n", - "- eval + improve on validation data\n", - "\n", - "* We imported the `TrainAndValid.csv` and filled missing values/evaluated on it already " - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePriceMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDesc...Undercarriage_Pad_Width_is_missingStick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missing
016467709500.01126363843413218.019740.004593...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
1182151414000.011940891015013299.019800.001820...TrueTrueTrueTrueTrueTrueTrueTrueFalseFalse
2150513850000.01473654413913299.019780.002348...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
3167117416000.01327630859113299.019800.001819...TrueTrueTrueTrueTrueTrueTrueTrueFalseFalse
4132905622000.01336053408913299.019840.002119...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
\n", - "

5 rows × 103 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1646770 9500.0 1126363 8434 132 18.0 1974 \n", - "1 1821514 14000.0 1194089 10150 132 99.0 1980 \n", - "2 1505138 50000.0 1473654 4139 132 99.0 1978 \n", - "3 1671174 16000.0 1327630 8591 132 99.0 1980 \n", - "4 1329056 22000.0 1336053 4089 132 99.0 1984 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc ... \\\n", - "0 0.0 0 4593 ... \n", - "1 0.0 0 1820 ... \n", - "2 0.0 0 2348 ... \n", - "3 0.0 0 1819 ... \n", - "4 0.0 0 2119 ... \n", - "\n", - " Undercarriage_Pad_Width_is_missing Stick_Length_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 True True \n", - "3 True True \n", - "4 True True \n", - "\n", - " Thumb_is_missing Pattern_Changer_is_missing Grouser_Type_is_missing \\\n", - "0 True True True \n", - "1 True True True \n", - "2 True True True \n", - "3 True True True \n", - "4 True True True \n", - "\n", - " Backhoe_Mounting_is_missing Blade_Type_is_missing \\\n", - "0 False False \n", - "1 True True \n", - "2 False False \n", - "3 True True \n", - "4 False False \n", - "\n", - " Travel_Controls_is_missing Differential_Type_is_missing \\\n", - "0 False True \n", - "1 True False \n", - "2 False True \n", - "3 True False \n", - "4 False True \n", - "\n", - " Steering_Controls_is_missing \n", - "0 True \n", - "1 False \n", - "2 True \n", - "3 False \n", - "4 True \n", - "\n", - "[5 rows x 103 columns]" - ] - }, - "execution_count": 45, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_tmp.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "According to the [Kaggle data page](https://www.kaggle.com/c/bluebook-for-bulldozers/data), the validation set and test set are split according to dates.\n", - "\n", - "This makes sense since we're working on a time series problem.\n", - "\n", - "E.g. using past events to try and predict future events.\n", - "\n", - "Knowing this, randomly splitting our data into train and test sets using something like `train_test_split()` wouldn't work.\n", - "\n", - "Instead, we split our data into training, validation and test sets using the date each sample occured.\n", - "\n", - "In our case:\n", - "* Training = all samples up until 2011\n", - "* Valid = all samples form January 1, 2012 - April 30, 2012\n", - "* Test = all samples from May 1, 2012 - November 2012\n", - "\n", - "For more on making good training, validation and test sets, check out the post [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) by Rachel Thomas." - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2009 43849\n", - "2008 39767\n", - "2011 35197\n", - "2010 33390\n", - "2007 32208\n", - "2006 21685\n", - "2005 20463\n", - "2004 19879\n", - "2001 17594\n", - "2000 17415\n", - "2002 17246\n", - "2003 15254\n", - "1998 13046\n", - "1999 12793\n", - "2012 11573\n", - "1997 9785\n", - "1996 8829\n", - "1995 8530\n", - "1994 7929\n", - "1993 6303\n", - "1992 5519\n", - "1991 5109\n", - "1989 4806\n", - "1990 4529\n", - "Name: saleYear, dtype: int64" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_tmp.saleYear.value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "(11573, 401125)" - ] - }, - "execution_count": 47, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Split data into training and validation\n", - "df_val = df_tmp[df_tmp.saleYear == 2012]\n", - "df_train = df_tmp[df_tmp.saleYear != 2012]\n", - "\n", - "len(df_val), len(df_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 48, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "((401125, 102), (401125,), (11573, 102), (11573,))" - ] - }, - "execution_count": 48, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Split data into X & y\n", - "X_train, y_train = df_train.drop(\"SalePrice\", axis=1), df_train.SalePrice\n", - "X_valid, y_valid = df_val.drop(\"SalePrice\", axis=1), df_val.SalePrice\n", - "\n", - "X_train.shape, y_train.shape, X_valid.shape, y_valid.shape" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Building an evaluation function\n", - "\n", - "According to Kaggle for the Bluebook for Bulldozers competition, [the evaluation function](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation) they use is root mean squared log error (RMSLE).\n", - "\n", - "**RMSLE** = generally you don't care as much if you're off by $10 as much as you'd care if you were off by 10%, you care more about ratios rather than differences. **MAE** (mean absolute error) is more about exact differences.\n", - "\n", - "It's important to understand the evaluation metric you're going for.\n", - "\n", - "Since Scikit-Learn doesn't have a function built-in for RMSLE, we'll create our own.\n", - "\n", - "We can do this by taking the square root of Scikit-Learn's [mean_squared_log_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_log_error.html#sklearn.metrics.mean_squared_log_error) (MSLE). MSLE is the same as taking the log of mean squared error (MSE).\n", - "\n", - "We'll also calculate the MAE and R^2 for fun.\n", - "\n", - "TK - use RMSLE from scikit-learn, see: https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.root_mean_squared_log_error.html#sklearn.metrics.root_mean_squared_log_error " - ] - }, - { - "cell_type": "code", - "execution_count": 49, - "metadata": {}, - "outputs": [], - "source": [ - "# Create evaluation function (the competition uses Root Mean Square Log Error)\n", - "from sklearn.metrics import mean_squared_log_error, mean_absolute_error\n", - "\n", - "# TK - can now use RMSLE from sckit-learn, see: https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.root_mean_squared_log_error.html#sklearn.metrics.root_mean_squared_log_error \n", - "def rmsle(y_test, y_preds):\n", - " return np.sqrt(mean_squared_log_error(y_test, y_preds))\n", - "\n", - "# Create function to evaluate our model\n", - "def show_scores(model):\n", - " train_preds = model.predict(X_train)\n", - " val_preds = model.predict(X_valid)\n", - " scores = {\"Training MAE\": mean_absolute_error(y_train, train_preds),\n", - " \"Valid MAE\": mean_absolute_error(y_valid, val_preds),\n", - " \"Training RMSLE\": rmsle(y_train, train_preds),\n", - " \"Valid RMSLE\": rmsle(y_valid, val_preds),\n", - " \"Training R^2\": model.score(X_train, y_train),\n", - " \"Valid R^2\": model.score(X_valid, y_valid)}\n", - " return scores" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Testing our model on a subset (to tune the hyperparameters)\n", - "\n", - "Retraing an entire model would take far too long to continuing experimenting as fast as we want to.\n", - "\n", - "So what we'll do is take a sample of the training set and tune the hyperparameters on that before training a larger model.\n", - "\n", - "If you're experiments are taking longer than 10-seconds (give or take how long you have to wait), you should be trying to speed things up. You can speed things up by sampling less data or using a faster computer." - ] - }, - { - "cell_type": "code", - "execution_count": 50, - "metadata": {}, - "outputs": [], - "source": [ - "# This takes too long...\n", - "\n", - "# %%time\n", - "# # Retrain a model on training data\n", - "# model.fit(X_train, y_train)\n", - "# show_scores(model)" - ] - }, - { - "cell_type": "code", - "execution_count": 51, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "401125" - ] - }, - "execution_count": 51, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "len(X_train)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Depending on your computer (mine is a MacBook Pro), making calculations on ~400,000 rows may take a while...\n", - "\n", - "Let's alter the number of samples each `n_estimator` in the [`RandomForestRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html) see's using the `max_samples` parameter." - ] - }, - { - "cell_type": "code", - "execution_count": 52, - "metadata": {}, - "outputs": [], - "source": [ - "# Change max samples in RandomForestRegressor\n", - "model = RandomForestRegressor(n_jobs=-1,\n", - " max_samples=10000)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Setting `max_samples` to 10000 means every `n_estimator` (default 100) in our `RandomForestRegressor` will only see 10000 random samples from our DataFrame instead of the entire 400,000.\n", - "\n", - "In other words, we'll be looking at 40x less samples which means we'll get faster computation speeds but we should expect our results to worsen (simple the model has less samples to learn patterns from)." - ] - }, - { - "cell_type": "code", - "execution_count": 53, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 32.5 s, sys: 68 ms, total: 32.6 s\n", - "Wall time: 2.31 s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_samples=10000, n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_samples=10000, n_jobs=-1)" - ] - }, - "execution_count": 53, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "# Cutting down the max number of samples each tree can see improves training time\n", - "model.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 54, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': 5567.491987510129,\n", - " 'Valid MAE': 7182.652944785276,\n", - " 'Training RMSLE': 0.25773885441307287,\n", - " 'Valid RMSLE': 0.2940155702000724,\n", - " 'Training R^2': 0.8599287949488105,\n", - " 'Valid R^2': 0.8311803212160184}" - ] - }, - "execution_count": 54, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "show_scores(model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beautiful, that took far less time than the model with all the data.\n", - "\n", - "With this, let's try tune some hyperparameters." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Hyperparameter tuning with RandomizedSearchCV\n", - "\n", - "You can increase `n_iter` to try more combinations of hyperparameters but in our case, we'll try 20 and see where it gets us.\n", - "\n", - "Remember, we're trying to reduce the amount of time it takes between experiments." - ] - }, - { - "cell_type": "code", - "execution_count": 56, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Fitting 5 folds for each of 20 candidates, totalling 100 fits\n", - "CPU times: user 3min 9s, sys: 3.28 s, total: 3min 12s\n", - "Wall time: 3min 12s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=20,\n",
-       "                   param_distributions={'max_depth': [None, 3, 5, 10],\n",
-       "                                        'max_features': [0.5, 1.0, 'sqrt'],\n",
-       "                                        'max_samples': [10000],\n",
-       "                                        'min_samples_leaf': array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]),\n",
-       "                                        'min_samples_split': array([ 2,  4,  6,  8, 10, 12, 14, 16, 18]),\n",
-       "                                        'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},\n",
-       "                   verbose=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=20,\n", - " param_distributions={'max_depth': [None, 3, 5, 10],\n", - " 'max_features': [0.5, 1.0, 'sqrt'],\n", - " 'max_samples': [10000],\n", - " 'min_samples_leaf': array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]),\n", - " 'min_samples_split': array([ 2, 4, 6, 8, 10, 12, 14, 16, 18]),\n", - " 'n_estimators': array([10, 20, 30, 40, 50, 60, 70, 80, 90])},\n", - " verbose=True)" - ] - }, - "execution_count": 56, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "from sklearn.model_selection import RandomizedSearchCV\n", - "\n", - "# Different RandomForestClassifier hyperparameters\n", - "rf_grid = {\"n_estimators\": np.arange(10, 100, 10),\n", - " \"max_depth\": [None, 3, 5, 10],\n", - " \"min_samples_split\": np.arange(2, 20, 2),\n", - " \"min_samples_leaf\": np.arange(1, 20, 2),\n", - " \"max_features\": [0.5, 1.0, \"sqrt\"], # Note: \"max_features='auto'\" is equivalent to \"max_features=1.0\", as of Scikit-Learn version 1.1\n", - " \"max_samples\": [10000]}\n", - "\n", - "rs_model = RandomizedSearchCV(RandomForestRegressor(),\n", - " param_distributions=rf_grid,\n", - " n_iter=20,\n", - " cv=5,\n", - " verbose=True)\n", - "\n", - "rs_model.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 57, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'n_estimators': 30,\n", - " 'min_samples_split': 2,\n", - " 'min_samples_leaf': 3,\n", - " 'max_samples': 10000,\n", - " 'max_features': 1.0,\n", - " 'max_depth': None}" - ] - }, - "execution_count": 57, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find the best parameters from the RandomizedSearch \n", - "rs_model.best_params_" - ] - }, - { - "cell_type": "code", - "execution_count": 58, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': 5725.976315602646,\n", - " 'Valid MAE': 7294.5625641579545,\n", - " 'Training RMSLE': 0.26348401446113984,\n", - " 'Valid RMSLE': 0.2960739319152716,\n", - " 'Training R^2': 0.8513015231223766,\n", - " 'Valid R^2': 0.82379444657151}" - ] - }, - "execution_count": 58, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Evaluate the RandomizedSearch model\n", - "show_scores(rs_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Train a model with the best parameters\n", - "\n", - "In a model I prepared earlier, I tried 100 different combinations of hyperparameters (setting `n_iter` to 100 in `RandomizedSearchCV`) and found the best results came from the ones you see below.\n", - "\n", - "**Note:** This kind of search on my computer (`n_iter` = 100) took ~2-hours. So it's kind of a set and come back later experiment.\n", - "\n", - "We'll instantiate a new model with these discovered hyperparameters and reset the `max_samples` back to its original value." - ] - }, - { - "cell_type": "code", - "execution_count": 59, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 8min 54s, sys: 471 ms, total: 8min 54s\n", - "Wall time: 35.4 s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=90,\n",
-       "                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_features=0.5, min_samples_split=14, n_estimators=90,\n", - " n_jobs=-1)" - ] - }, - "execution_count": 59, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "# Most ideal hyperparameters\n", - "ideal_model = RandomForestRegressor(n_estimators=90,\n", - " min_samples_leaf=1,\n", - " min_samples_split=14,\n", - " max_features=0.5,\n", - " n_jobs=-1,\n", - " max_samples=None)\n", - "ideal_model.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 60, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': 2930.02086721936,\n", - " 'Valid MAE': 5942.019241517711,\n", - " 'Training RMSLE': 0.1434641550675688,\n", - " 'Valid RMSLE': 0.2451178900180815,\n", - " 'Training R^2': 0.9595925460436532,\n", - " 'Valid R^2': 0.8822734761669947}" - ] - }, - "execution_count": 60, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "show_scores(ideal_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "With these new hyperparameters as well as using all the samples, we can see an improvement to our models performance.\n", - "\n", - "You can make a faster model by altering some of the hyperparameters. Particularly by lowering `n_estimators` since each increase in `n_estimators` is basically building another small model.\n", - "\n", - "However, lowering of `n_estimators` or altering of other hyperparameters may lead to poorer results." - ] - }, - { - "cell_type": "code", - "execution_count": 61, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CPU times: user 3min 45s, sys: 304 ms, total: 3min 45s\n", - "Wall time: 15.9 s\n" - ] - }, - { - "data": { - "text/html": [ - "
RandomForestRegressor(max_features=0.5, min_samples_leaf=3, n_estimators=40,\n",
-       "                      n_jobs=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" - ], - "text/plain": [ - "RandomForestRegressor(max_features=0.5, min_samples_leaf=3, n_estimators=40,\n", - " n_jobs=-1)" - ] - }, - "execution_count": 61, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "%%time\n", - "# Faster model\n", - "fast_model = RandomForestRegressor(n_estimators=40,\n", - " min_samples_leaf=3,\n", - " max_features=0.5,\n", - " n_jobs=-1)\n", - "fast_model.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 62, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'Training MAE': 2548.9560499523304,\n", - " 'Valid MAE': 5923.285839919034,\n", - " 'Training RMSLE': 0.12974340722269298,\n", - " 'Valid RMSLE': 0.2440897497981412,\n", - " 'Training R^2': 0.9670593150524459,\n", - " 'Valid R^2': 0.8818131128042139}" - ] - }, - "execution_count": 62, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "show_scores(fast_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Make predictions on test data\n", - "\n", - "Now we've got a trained model, it's time to make predictions on the test data.\n", - "\n", - "Remember what we've done.\n", - "\n", - "Our model is trained on data prior to 2011. However, the test data is from May 1 2012 to November 2012.\n", - "\n", - "So what we're doing is trying to use the patterns our model has learned in the training data to predict the sale price of a Bulldozer with characteristics it's never seen before but are assumed to be similar to that of those in the training data." - ] - }, - { - "cell_type": "code", - "execution_count": 63, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandsaledatefiModelDesc...Undercarriage_Pad_WidthStick_LengthThumbPattern_ChangerGrouser_TypeBackhoe_MountingBlade_TypeTravel_ControlsDifferential_TypeSteering_Controls
0122782910063093168121319993688.0Low2012-05-03580G...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
11227844102281772711213100028555.0High2012-05-10936...NaNNaNNaNNaNNaNNaNNaNNaNStandardConventional
21227847103156022805121320046038.0Medium2012-05-10EC210BLC...None or Unspecified9' 6\"ManualNone or UnspecifiedDoubleNaNNaNNaNNaNNaN
31227848562041269121320068940.0High2012-05-10330CL...None or UnspecifiedNone or UnspecifiedManualYesTripleNaNNaNNaNNaNNaN
41227863105388722312121320052286.0Low2012-05-10650K...NaNNaNNaNNaNNaNNone or UnspecifiedPATNone or UnspecifiedNaNNaN
\n", - "

5 rows × 52 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1227829 1006309 3168 121 3 1999 \n", - "1 1227844 1022817 7271 121 3 1000 \n", - "2 1227847 1031560 22805 121 3 2004 \n", - "3 1227848 56204 1269 121 3 2006 \n", - "4 1227863 1053887 22312 121 3 2005 \n", - "\n", - " MachineHoursCurrentMeter UsageBand saledate fiModelDesc ... \\\n", - "0 3688.0 Low 2012-05-03 580G ... \n", - "1 28555.0 High 2012-05-10 936 ... \n", - "2 6038.0 Medium 2012-05-10 EC210BLC ... \n", - "3 8940.0 High 2012-05-10 330CL ... \n", - "4 2286.0 Low 2012-05-10 650K ... \n", - "\n", - " Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN \n", - "2 None or Unspecified 9' 6\" Manual None or Unspecified \n", - "3 None or Unspecified None or Unspecified Manual Yes \n", - "4 NaN NaN NaN NaN \n", - "\n", - " Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls \\\n", - "0 NaN NaN NaN NaN \n", - "1 NaN NaN NaN NaN \n", - "2 Double NaN NaN NaN \n", - "3 Triple NaN NaN NaN \n", - "4 NaN None or Unspecified PAT None or Unspecified \n", - "\n", - " Differential_Type Steering_Controls \n", - "0 NaN NaN \n", - "1 Standard Conventional \n", - "2 NaN NaN \n", - "3 NaN NaN \n", - "4 NaN NaN \n", - "\n", - "[5 rows x 52 columns]" - ] - }, - "execution_count": 63, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_test = pd.read_csv(\"../data/bluebook-for-bulldozers/Test.csv\",\n", - " parse_dates=[\"saledate\"])\n", - "df_test.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 64, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "The feature names should match those that were passed during fit.\nFeature names unseen at fit time:\n- saledate\nFeature names seen at fit time, yet now missing:\n- Backhoe_Mounting_is_missing\n- Blade_Extension_is_missing\n- Blade_Type_is_missing\n- Blade_Width_is_missing\n- Coupler_System_is_missing\n- ...\n", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/home/daniel/code/zero-to-mastery-ml/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression.ipynb Cell 93\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[39m# Let's see how the model goes predicting on the test data\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m model\u001b[39m.\u001b[39;49mpredict(df_test)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py:984\u001b[0m, in \u001b[0;36mForestRegressor.predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 982\u001b[0m check_is_fitted(\u001b[39mself\u001b[39m)\n\u001b[1;32m 983\u001b[0m \u001b[39m# Check data\u001b[39;00m\n\u001b[0;32m--> 984\u001b[0m X \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_X_predict(X)\n\u001b[1;32m 986\u001b[0m \u001b[39m# Assign chunk of trees to jobs\u001b[39;00m\n\u001b[1;32m 987\u001b[0m n_jobs, _, _ \u001b[39m=\u001b[39m _partition_estimators(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_estimators, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_jobs)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py:599\u001b[0m, in \u001b[0;36mBaseForest._validate_X_predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 596\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 597\u001b[0m \u001b[39mValidate X whenever one tries to predict, apply, predict_proba.\"\"\"\u001b[39;00m\n\u001b[1;32m 598\u001b[0m check_is_fitted(\u001b[39mself\u001b[39m)\n\u001b[0;32m--> 599\u001b[0m X \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_data(X, dtype\u001b[39m=\u001b[39;49mDTYPE, accept_sparse\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mcsr\u001b[39;49m\u001b[39m\"\u001b[39;49m, reset\u001b[39m=\u001b[39;49m\u001b[39mFalse\u001b[39;49;00m)\n\u001b[1;32m 600\u001b[0m \u001b[39mif\u001b[39;00m issparse(X) \u001b[39mand\u001b[39;00m (X\u001b[39m.\u001b[39mindices\u001b[39m.\u001b[39mdtype \u001b[39m!=\u001b[39m np\u001b[39m.\u001b[39mintc \u001b[39mor\u001b[39;00m X\u001b[39m.\u001b[39mindptr\u001b[39m.\u001b[39mdtype \u001b[39m!=\u001b[39m np\u001b[39m.\u001b[39mintc):\n\u001b[1;32m 601\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mNo support for np.int64 index based sparse matrices\u001b[39m\u001b[39m\"\u001b[39m)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/base.py:579\u001b[0m, in \u001b[0;36mBaseEstimator._validate_data\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 508\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_validate_data\u001b[39m(\n\u001b[1;32m 509\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 510\u001b[0m X\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mno_validation\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 515\u001b[0m \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mcheck_params,\n\u001b[1;32m 516\u001b[0m ):\n\u001b[1;32m 517\u001b[0m \u001b[39m\"\"\"Validate input data and set or check the `n_features_in_` attribute.\u001b[39;00m\n\u001b[1;32m 518\u001b[0m \n\u001b[1;32m 519\u001b[0m \u001b[39m Parameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 577\u001b[0m \u001b[39m validated.\u001b[39;00m\n\u001b[1;32m 578\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 579\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_check_feature_names(X, reset\u001b[39m=\u001b[39;49mreset)\n\u001b[1;32m 581\u001b[0m \u001b[39mif\u001b[39;00m y \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_tags()[\u001b[39m\"\u001b[39m\u001b[39mrequires_y\u001b[39m\u001b[39m\"\u001b[39m]:\n\u001b[1;32m 582\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 583\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mThis \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__class__\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m estimator \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 584\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mrequires y to be passed, but the target y is None.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 585\u001b[0m )\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/base.py:506\u001b[0m, in \u001b[0;36mBaseEstimator._check_feature_names\u001b[0;34m(self, X, reset)\u001b[0m\n\u001b[1;32m 501\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m missing_names \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m unexpected_names:\n\u001b[1;32m 502\u001b[0m message \u001b[39m+\u001b[39m\u001b[39m=\u001b[39m (\n\u001b[1;32m 503\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mFeature names must be in the same order as they were in fit.\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[1;32m 504\u001b[0m )\n\u001b[0;32m--> 506\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(message)\n", - "\u001b[0;31mValueError\u001b[0m: The feature names should match those that were passed during fit.\nFeature names unseen at fit time:\n- saledate\nFeature names seen at fit time, yet now missing:\n- Backhoe_Mounting_is_missing\n- Blade_Extension_is_missing\n- Blade_Type_is_missing\n- Blade_Width_is_missing\n- Coupler_System_is_missing\n- ...\n" - ] - } - ], - "source": [ - "# Let's see how the model goes predicting on the test data\n", - "model.predict(df_test)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ahhh... the test data isn't in the same format of our other data, so we have to fix it. Let's create a function to preprocess our data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Preprocessing the test data\n", - "\n", - "Our model has been trained on data formatted in the same way as the training data.\n", - "\n", - "This means in order to make predictions on the test data, we need to take the same steps we used to preprocess the training data to preprocess the test data.\n", - "\n", - "Remember: Whatever you do to the training data, you have to do to the test data.\n", - "\n", - "Let's create a function for doing so (by copying the preprocessing steps we used above)." - ] - }, - { - "cell_type": "code", - "execution_count": 65, - "metadata": {}, - "outputs": [], - "source": [ - "def preprocess_data(df):\n", - " # Add datetime parameters for saledate\n", - " df[\"saleYear\"] = df.saledate.dt.year\n", - " df[\"saleMonth\"] = df.saledate.dt.month\n", - " df[\"saleDay\"] = df.saledate.dt.day\n", - " df[\"saleDayofweek\"] = df.saledate.dt.dayofweek\n", - " df[\"saleDayofyear\"] = df.saledate.dt.dayofyear\n", - "\n", - " # Drop original saledate\n", - " df.drop(\"saledate\", axis=1, inplace=True)\n", - " \n", - " # Fill numeric rows with the median\n", - " for label, content in df.items():\n", - " if pd.api.types.is_numeric_dtype(content):\n", - " if pd.isnull(content).sum():\n", - " df[label+\"_is_missing\"] = pd.isnull(content)\n", - " df[label] = content.fillna(content.median())\n", - " \n", - " # Turn categorical variables into numbers\n", - " if not pd.api.types.is_numeric_dtype(content):\n", - " df[label+\"_is_missing\"] = pd.isnull(content)\n", - " # We add the +1 because pandas encodes missing categories as -1\n", - " df[label] = pd.Categorical(content).codes+1 \n", - " \n", - " return df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Question:** Where would this function break?\n", - "\n", - "**Hint:** What if the test data had different missing values to the training data?\n", - "\n", - "Now we've got a function for preprocessing data, let's preprocess the test dataset into the same format as our training dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 66, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Undercarriage_Pad_Width_is_missingStick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missing
0122782910063093168121319993688.02499180...TrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
11227844102281772711213100028555.01831292...TrueTrueTrueTrueTrueTrueTrueTrueFalseFalse
21227847103156022805121320046038.031177404...FalseFalseFalseFalseFalseTrueTrueTrueTrueTrue
31227848562041269121320068940.01287113...FalseFalseFalseFalseFalseTrueTrueTrueTrueTrue
41227863105388722312121320052286.02566196...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
\n", - "

5 rows × 101 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1227829 1006309 3168 121 3 1999 \n", - "1 1227844 1022817 7271 121 3 1000 \n", - "2 1227847 1031560 22805 121 3 2004 \n", - "3 1227848 56204 1269 121 3 2006 \n", - "4 1227863 1053887 22312 121 3 2005 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 3688.0 2 499 180 ... \n", - "1 28555.0 1 831 292 ... \n", - "2 6038.0 3 1177 404 ... \n", - "3 8940.0 1 287 113 ... \n", - "4 2286.0 2 566 196 ... \n", - "\n", - " Undercarriage_Pad_Width_is_missing Stick_Length_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 False False \n", - "3 False False \n", - "4 True True \n", - "\n", - " Thumb_is_missing Pattern_Changer_is_missing Grouser_Type_is_missing \\\n", - "0 True True True \n", - "1 True True True \n", - "2 False False False \n", - "3 False False False \n", - "4 True True True \n", - "\n", - " Backhoe_Mounting_is_missing Blade_Type_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 True True \n", - "3 True True \n", - "4 False False \n", - "\n", - " Travel_Controls_is_missing Differential_Type_is_missing \\\n", - "0 True True \n", - "1 True False \n", - "2 True True \n", - "3 True True \n", - "4 False True \n", - "\n", - " Steering_Controls_is_missing \n", - "0 True \n", - "1 False \n", - "2 True \n", - "3 True \n", - "4 True \n", - "\n", - "[5 rows x 101 columns]" - ] - }, - "execution_count": 66, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df_test = preprocess_data(df_test)\n", - "df_test.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 67, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Undercarriage_Pad_Width_is_missingStick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missing
016467701126363843413218.019740.0045931744...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
1182151411940891015013299.019800.001820559...TrueTrueTrueTrueTrueTrueTrueTrueFalseFalse
215051381473654413913299.019780.002348713...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
316711741327630859113299.019800.001819558...TrueTrueTrueTrueTrueTrueTrueTrueFalseFalse
413290561336053408913299.019840.002119683...TrueTrueTrueTrueTrueFalseFalseFalseTrueTrue
\n", - "

5 rows × 102 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1646770 1126363 8434 132 18.0 1974 \n", - "1 1821514 1194089 10150 132 99.0 1980 \n", - "2 1505138 1473654 4139 132 99.0 1978 \n", - "3 1671174 1327630 8591 132 99.0 1980 \n", - "4 1329056 1336053 4089 132 99.0 1984 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 0.0 0 4593 1744 ... \n", - "1 0.0 0 1820 559 ... \n", - "2 0.0 0 2348 713 ... \n", - "3 0.0 0 1819 558 ... \n", - "4 0.0 0 2119 683 ... \n", - "\n", - " Undercarriage_Pad_Width_is_missing Stick_Length_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 True True \n", - "3 True True \n", - "4 True True \n", - "\n", - " Thumb_is_missing Pattern_Changer_is_missing Grouser_Type_is_missing \\\n", - "0 True True True \n", - "1 True True True \n", - "2 True True True \n", - "3 True True True \n", - "4 True True True \n", - "\n", - " Backhoe_Mounting_is_missing Blade_Type_is_missing \\\n", - "0 False False \n", - "1 True True \n", - "2 False False \n", - "3 True True \n", - "4 False False \n", - "\n", - " Travel_Controls_is_missing Differential_Type_is_missing \\\n", - "0 False True \n", - "1 True False \n", - "2 False True \n", - "3 True False \n", - "4 False True \n", - "\n", - " Steering_Controls_is_missing \n", - "0 True \n", - "1 False \n", - "2 True \n", - "3 False \n", - "4 True \n", - "\n", - "[5 rows x 102 columns]" - ] - }, - "execution_count": 67, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 68, - "metadata": {}, - "outputs": [ - { - "ename": "ValueError", - "evalue": "The feature names should match those that were passed during fit.\nFeature names seen at fit time, yet now missing:\n- auctioneerID_is_missing\n", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", - "\u001b[1;32m/home/daniel/code/zero-to-mastery-ml/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression.ipynb Cell 100\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[39m# Make predictions on the test dataset using the best model\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m test_preds \u001b[39m=\u001b[39m ideal_model\u001b[39m.\u001b[39;49mpredict(df_test)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py:984\u001b[0m, in \u001b[0;36mForestRegressor.predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 982\u001b[0m check_is_fitted(\u001b[39mself\u001b[39m)\n\u001b[1;32m 983\u001b[0m \u001b[39m# Check data\u001b[39;00m\n\u001b[0;32m--> 984\u001b[0m X \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_X_predict(X)\n\u001b[1;32m 986\u001b[0m \u001b[39m# Assign chunk of trees to jobs\u001b[39;00m\n\u001b[1;32m 987\u001b[0m n_jobs, _, _ \u001b[39m=\u001b[39m _partition_estimators(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_estimators, \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mn_jobs)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/ensemble/_forest.py:599\u001b[0m, in \u001b[0;36mBaseForest._validate_X_predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 596\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m 597\u001b[0m \u001b[39mValidate X whenever one tries to predict, apply, predict_proba.\"\"\"\u001b[39;00m\n\u001b[1;32m 598\u001b[0m check_is_fitted(\u001b[39mself\u001b[39m)\n\u001b[0;32m--> 599\u001b[0m X \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_validate_data(X, dtype\u001b[39m=\u001b[39;49mDTYPE, accept_sparse\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mcsr\u001b[39;49m\u001b[39m\"\u001b[39;49m, reset\u001b[39m=\u001b[39;49m\u001b[39mFalse\u001b[39;49;00m)\n\u001b[1;32m 600\u001b[0m \u001b[39mif\u001b[39;00m issparse(X) \u001b[39mand\u001b[39;00m (X\u001b[39m.\u001b[39mindices\u001b[39m.\u001b[39mdtype \u001b[39m!=\u001b[39m np\u001b[39m.\u001b[39mintc \u001b[39mor\u001b[39;00m X\u001b[39m.\u001b[39mindptr\u001b[39m.\u001b[39mdtype \u001b[39m!=\u001b[39m np\u001b[39m.\u001b[39mintc):\n\u001b[1;32m 601\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mNo support for np.int64 index based sparse matrices\u001b[39m\u001b[39m\"\u001b[39m)\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/base.py:579\u001b[0m, in \u001b[0;36mBaseEstimator._validate_data\u001b[0;34m(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)\u001b[0m\n\u001b[1;32m 508\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_validate_data\u001b[39m(\n\u001b[1;32m 509\u001b[0m \u001b[39mself\u001b[39m,\n\u001b[1;32m 510\u001b[0m X\u001b[39m=\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mno_validation\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 515\u001b[0m \u001b[39m*\u001b[39m\u001b[39m*\u001b[39mcheck_params,\n\u001b[1;32m 516\u001b[0m ):\n\u001b[1;32m 517\u001b[0m \u001b[39m\"\"\"Validate input data and set or check the `n_features_in_` attribute.\u001b[39;00m\n\u001b[1;32m 518\u001b[0m \n\u001b[1;32m 519\u001b[0m \u001b[39m Parameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 577\u001b[0m \u001b[39m validated.\u001b[39;00m\n\u001b[1;32m 578\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[0;32m--> 579\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_check_feature_names(X, reset\u001b[39m=\u001b[39;49mreset)\n\u001b[1;32m 581\u001b[0m \u001b[39mif\u001b[39;00m y \u001b[39mis\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_get_tags()[\u001b[39m\"\u001b[39m\u001b[39mrequires_y\u001b[39m\u001b[39m\"\u001b[39m]:\n\u001b[1;32m 582\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[1;32m 583\u001b[0m \u001b[39mf\u001b[39m\u001b[39m\"\u001b[39m\u001b[39mThis \u001b[39m\u001b[39m{\u001b[39;00m\u001b[39mself\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__class__\u001b[39m\u001b[39m.\u001b[39m\u001b[39m__name__\u001b[39m\u001b[39m}\u001b[39;00m\u001b[39m estimator \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 584\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mrequires y to be passed, but the target y is None.\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[1;32m 585\u001b[0m )\n", - "File \u001b[0;32m~/code/pytorch/env/lib/python3.8/site-packages/sklearn/base.py:506\u001b[0m, in \u001b[0;36mBaseEstimator._check_feature_names\u001b[0;34m(self, X, reset)\u001b[0m\n\u001b[1;32m 501\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mnot\u001b[39;00m missing_names \u001b[39mand\u001b[39;00m \u001b[39mnot\u001b[39;00m unexpected_names:\n\u001b[1;32m 502\u001b[0m message \u001b[39m+\u001b[39m\u001b[39m=\u001b[39m (\n\u001b[1;32m 503\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mFeature names must be in the same order as they were in fit.\u001b[39m\u001b[39m\\n\u001b[39;00m\u001b[39m\"\u001b[39m\n\u001b[1;32m 504\u001b[0m )\n\u001b[0;32m--> 506\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(message)\n", - "\u001b[0;31mValueError\u001b[0m: The feature names should match those that were passed during fit.\nFeature names seen at fit time, yet now missing:\n- auctioneerID_is_missing\n" - ] - } - ], - "source": [ - "# Make predictions on the test dataset using the best model\n", - "test_preds = ideal_model.predict(df_test)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We've found an error and it's because our test dataset (after preprocessing) has 101 columns where as, our training dataset (`X_train`) has 102 columns (after preprocessing).\n", - "\n", - "Let's find the difference." - ] - }, - { - "cell_type": "code", - "execution_count": 69, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'auctioneerID_is_missing'}" - ] - }, - "execution_count": 69, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# We can find how the columns differ using sets\n", - "set(X_train.columns) - set(df_test.columns)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this case, it's because the test dataset wasn't missing any `auctioneerID` fields.\n", - "\n", - "To fix it, we'll add a column to the test dataset called `auctioneerID_is_missing` and fill it with `False`, since none of the `auctioneerID` fields are missing in the test dataset." - ] - }, - { - "cell_type": "code", - "execution_count": 70, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDMachineIDModelIDdatasourceauctioneerIDYearMadeMachineHoursCurrentMeterUsageBandfiModelDescfiBaseModel...Stick_Length_is_missingThumb_is_missingPattern_Changer_is_missingGrouser_Type_is_missingBackhoe_Mounting_is_missingBlade_Type_is_missingTravel_Controls_is_missingDifferential_Type_is_missingSteering_Controls_is_missingauctioneerID_is_missing
0122782910063093168121319993688.02499180...TrueTrueTrueTrueTrueTrueTrueTrueTrueFalse
11227844102281772711213100028555.01831292...TrueTrueTrueTrueTrueTrueTrueFalseFalseFalse
21227847103156022805121320046038.031177404...FalseFalseFalseFalseTrueTrueTrueTrueTrueFalse
31227848562041269121320068940.01287113...FalseFalseFalseFalseTrueTrueTrueTrueTrueFalse
41227863105388722312121320052286.02566196...TrueTrueTrueTrueFalseFalseFalseTrueTrueFalse
\n", - "

5 rows × 102 columns

\n", - "
" - ], - "text/plain": [ - " SalesID MachineID ModelID datasource auctioneerID YearMade \\\n", - "0 1227829 1006309 3168 121 3 1999 \n", - "1 1227844 1022817 7271 121 3 1000 \n", - "2 1227847 1031560 22805 121 3 2004 \n", - "3 1227848 56204 1269 121 3 2006 \n", - "4 1227863 1053887 22312 121 3 2005 \n", - "\n", - " MachineHoursCurrentMeter UsageBand fiModelDesc fiBaseModel ... \\\n", - "0 3688.0 2 499 180 ... \n", - "1 28555.0 1 831 292 ... \n", - "2 6038.0 3 1177 404 ... \n", - "3 8940.0 1 287 113 ... \n", - "4 2286.0 2 566 196 ... \n", - "\n", - " Stick_Length_is_missing Thumb_is_missing Pattern_Changer_is_missing \\\n", - "0 True True True \n", - "1 True True True \n", - "2 False False False \n", - "3 False False False \n", - "4 True True True \n", - "\n", - " Grouser_Type_is_missing Backhoe_Mounting_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 False True \n", - "3 False True \n", - "4 True False \n", - "\n", - " Blade_Type_is_missing Travel_Controls_is_missing \\\n", - "0 True True \n", - "1 True True \n", - "2 True True \n", - "3 True True \n", - "4 False False \n", - "\n", - " Differential_Type_is_missing Steering_Controls_is_missing \\\n", - "0 True True \n", - "1 False False \n", - "2 True True \n", - "3 True True \n", - "4 True True \n", - "\n", - " auctioneerID_is_missing \n", - "0 False \n", - "1 False \n", - "2 False \n", - "3 False \n", - "4 False \n", - "\n", - "[5 rows x 102 columns]" - ] - }, - "execution_count": 70, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Match test dataset columns to training dataset\n", - "df_test[\"auctioneerID_is_missing\"] = False\n", - "df_test.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There's one more step we have to do before we can make predictions on the test data.\n", - "\n", - "And that's to line up the columns (the features) in our test dataset to match the columns in our training dataset.\n", - "\n", - "As in, the order of the columnns in the training dataset, should match the order of the columns in our test dataset.\n", - "\n", - "> **Note:** As of Scikit-Learn 1.2, the order of columns that were fit on should match the order of columns that are predicted on." - ] - }, - { - "cell_type": "code", - "execution_count": 71, - "metadata": {}, - "outputs": [], - "source": [ - "# Match column order from X_train to df_test (to predict on columns, they should be in the same order they were fit on)\n", - "df_test = df_test[X_train.columns]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now the test dataset column names and column order matches the training dataset, we should be able to make predictions on it using our trained model. " - ] - }, - { - "cell_type": "code", - "execution_count": 72, - "metadata": {}, - "outputs": [], - "source": [ - "# Make predictions on the test dataset using the best model\n", - "test_preds = ideal_model.predict(df_test)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When looking at the [Kaggle submission requirements](https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation), we see that if we wanted to make a submission, the data is required to be in a certain format. Namely, a DataFrame containing the `SalesID` and the predicted `SalePrice` of the bulldozer.\n", - "\n", - "Let's make it." - ] - }, - { - "cell_type": "code", - "execution_count": 73, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SalesIDSalePrice
0122782918198.663455
1122784415747.746682
2122784748390.264848
3122784865840.504704
4122786358180.230933
.........
12452664317143613.653622
12453664317313335.788097
12454664318413009.453670
12455664318617151.096779
12456664319629025.076618
\n", - "

12457 rows × 2 columns

\n", - "
" - ], - "text/plain": [ - " SalesID SalePrice\n", - "0 1227829 18198.663455\n", - "1 1227844 15747.746682\n", - "2 1227847 48390.264848\n", - "3 1227848 65840.504704\n", - "4 1227863 58180.230933\n", - "... ... ...\n", - "12452 6643171 43613.653622\n", - "12453 6643173 13335.788097\n", - "12454 6643184 13009.453670\n", - "12455 6643186 17151.096779\n", - "12456 6643196 29025.076618\n", - "\n", - "[12457 rows x 2 columns]" - ] - }, - "execution_count": 73, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Create DataFrame compatible with Kaggle submission requirements\n", - "df_preds = pd.DataFrame()\n", - "df_preds[\"SalesID\"] = df_test[\"SalesID\"]\n", - "df_preds[\"SalePrice\"] = test_preds\n", - "df_preds" - ] - }, - { - "cell_type": "code", - "execution_count": 74, - "metadata": {}, - "outputs": [], - "source": [ - "# Export to csv...\n", - "# TK - update this to export to Parquet? Or CSV is enough...?\n", - "#df_preds.to_csv(\"../data/bluebook-for-bulldozers/predictions.csv\",\n", - "# index=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## TK - Add a section where we create a purely custom sample using the available columns, e.g. a custom bulldozer sale built into an app -> model outputs price prediction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Feature Importance\n", - "\n", - "Since we've built a model which is able to make predictions. The people you share these predictions with (or yourself) might be curious of what parts of the data led to these predictions.\n", - "\n", - "This is where **feature importance** comes in. Feature importance seeks to figure out which different attributes of the data were most important when it comes to predicting the **target variable**.\n", - "\n", - "In our case, after our model learned the patterns in the data, which bulldozer sale attributes were most important for predicting its overall sale price?\n", - "\n", - "Beware: the default feature importances for random forests can lead to non-ideal results.\n", - "\n", - "To find which features were most important of a machine learning model, a good idea is to search something like \"\\[MODEL NAME\\] feature importance\".\n", - "\n", - "Doing this for our `RandomForestRegressor` leads us to find the `feature_importances_` attribute.\n", - "\n", - "Let's check it out." - ] - }, - { - "cell_type": "code", - "execution_count": 75, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([3.46627679e-02, 1.56764228e-02, 4.37926114e-02, 1.67650333e-03,\n", - " 3.34502142e-03, 1.97197187e-01, 3.06340515e-03, 9.77673001e-04,\n", - " 4.36749162e-02, 4.29037954e-02, 6.72975373e-02, 4.75148666e-03,\n", - " 1.53850682e-02, 1.55090204e-01, 4.45514872e-02, 5.95084780e-03,\n", - " 3.10028046e-03, 3.63841539e-03, 3.19638601e-03, 8.14915664e-02,\n", - " 6.53889246e-04, 5.94194750e-05, 1.41170245e-03, 2.27382622e-04,\n", - " 1.14398876e-03, 1.34339699e-04, 1.22607934e-03, 1.20989969e-02,\n", - " 1.44058495e-04, 1.35266062e-03, 3.36823266e-03, 3.42373542e-03,\n", - " 4.15153438e-03, 7.64328000e-04, 2.33613372e-03, 6.31647990e-03,\n", - " 9.15661618e-04, 1.20935454e-02, 1.89512094e-03, 2.06103870e-03,\n", - " 1.05929984e-03, 8.74105415e-04, 2.29493677e-03, 5.64128997e-04,\n", - " 7.38134764e-04, 3.60958405e-04, 2.88470991e-04, 2.15278313e-03,\n", - " 9.99457824e-04, 2.60780750e-04, 2.25566970e-04, 7.31244555e-02,\n", - " 3.78194598e-03, 5.69220059e-03, 2.90964527e-03, 9.93488783e-03,\n", - " 2.56180192e-04, 1.47457488e-03, 3.42954304e-04, 0.00000000e+00,\n", - " 0.00000000e+00, 1.91457326e-03, 1.34351415e-03, 5.69871181e-03,\n", - " 2.39504831e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,\n", - " 0.00000000e+00, 5.97282251e-05, 7.01667140e-06, 3.59081182e-04,\n", - " 2.73656952e-05, 1.78603144e-04, 6.52558655e-05, 2.63920230e-04,\n", - " 3.48499839e-05, 1.21327449e-03, 1.78540890e-03, 9.39776578e-04,\n", - " 6.85501348e-05, 2.29510723e-03, 9.72534799e-04, 2.28851486e-03,\n", - " 1.43734099e-03, 9.74602815e-04, 3.39223163e-03, 1.66720737e-04,\n", - " 1.11866001e-02, 1.45338897e-03, 2.03979079e-03, 5.52949060e-05,\n", - " 9.35368656e-05, 5.93897732e-05, 7.88358813e-05, 6.16130828e-05,\n", - " 8.52025865e-05, 3.58013480e-04, 1.26474014e-04, 1.53885294e-04,\n", - " 8.63610944e-05, 1.87067289e-04])" - ] - }, - "execution_count": 75, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Find feature importance of our best model\n", - "ideal_model.feature_importances_" - ] - }, - { - "cell_type": "code", - "execution_count": 76, - "metadata": {}, - "outputs": [], - "source": [ - "# Install Seaborn package in current environment (if you don't have it)\n", - "# import sys\n", - "# !conda install --yes --prefix {sys.prefix} seaborn" - ] - }, - { - "cell_type": "code", - "execution_count": 77, - "metadata": {}, - "outputs": [], - "source": [ - "import seaborn as sns\n", - "\n", - "# Helper function for plotting feature importance\n", - "def plot_features(columns, importances, n=20):\n", - " df = (pd.DataFrame({\"features\": columns,\n", - " \"feature_importance\": importances})\n", - " .sort_values(\"feature_importance\", ascending=False)\n", - " .reset_index(drop=True))\n", - " \n", - " sns.barplot(x=\"feature_importance\",\n", - " y=\"features\",\n", - " data=df[:n],\n", - " orient=\"h\")" - ] - }, - { - "cell_type": "code", - "execution_count": 78, - "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": { - "needs_background": "light" - }, - "output_type": "display_data" - } - ], - "source": [ - "plot_features(X_train.columns, ideal_model.feature_importances_)" - ] - }, - { - "cell_type": "code", - "execution_count": 79, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "1.0000000000000002" - ] - }, - "execution_count": 79, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "sum(ideal_model.feature_importances_)" - ] - }, - { - "cell_type": "code", - "execution_count": 80, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "216605" - ] - }, - "execution_count": 80, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.ProductSize.isna().sum()" - ] - }, - { - "cell_type": "code", - "execution_count": 81, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Medium 64342\n", - "Large / Medium 51297\n", - "Small 27057\n", - "Mini 25721\n", - "Large 21396\n", - "Compact 6280\n", - "Name: ProductSize, dtype: int64" - ] - }, - "execution_count": 81, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.ProductSize.value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 82, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "None or Unspecified 77111\n", - "Yes 3985\n", - "Name: Turbocharged, dtype: int64" - ] - }, - "execution_count": 82, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.Turbocharged.value_counts()" - ] - }, - { - "cell_type": "code", - "execution_count": 83, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "None or Unspecified 85074\n", - "Manual 9678\n", - "Hydraulic 7580\n", - "Name: Thumb, dtype: int64" - ] - }, - "execution_count": 83, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.Thumb.value_counts()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extensions and Extra-curriculum\n", - "\n", - "* Extra-curriculum: read pandas io tools for info on parquet/feather data formats - [IO tools documentation page](https://pandas.pydata.org/docs/user_guide/io.html#). \n", - "\n", - "* See all of the pandas dtypes in the pandas user guide: https://pandas.pydata.org/docs/user_guide/basics.html#dtypes \n", - "* > **Note:** There are some ML models such as [`sklearn.ensemble.HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html), [CatBoost](https://catboost.ai/) and [XGBoost](https://xgboost.ai/) which can handle missing values, however, I'll leave exploring each of these as extra-curriculum/extensions." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}