diff --git a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb b/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb index d7cd941d8..d2cd659de 100644 --- a/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb +++ b/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb @@ -4,14 +4,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "TK - add Google Colab link as well as reference notebook etc" + "\n", + " \"Open\n", + "\n", + "\n", + "[View source code](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-3-structured-data-projects/end-to-end-bluebook-bulldozer-price-regression-v2.ipynb) | [Read notebook in online book format](https://dev.mrdbourke.com/zero-to-mastery-ml/end-to-end-bluebook-for-bulldozers-price-regression-v2.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# 🚜 Predicting the Sale Price of Bulldozers using Machine Learning \n", + "# Predicting the Sale Price of Bulldozers using Machine Learning 🚜 \n", "\n", "In this notebook, we're going to go through an example machine learning project to use the characteristics of bulldozers and their past sales prices to predict the sale price of future bulldozers based on their characteristics.\n", "\n", @@ -5722,64 +5726,6 @@ "\n", "TK - does this table show up?\n", "\n", - "| **Encoder** | **Description** | **Use case** | **For use on** |\n", - "|-------------|-----------------|--------------|----------------|\n", - "| [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) | Encode target labels with value between 0 and n_classes-1. | Useful for turning classification target values into numeric representations. | Target labels. |\n", - "| [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#onehotencoder) | Encode categorical features as a [one-hot numeric array](https://en.wikipedia.org/wiki/One-hot). | Turns every positive class of a unique category into a 1 and every negative class into a 0. | Categorical variables/features. |\n", - "| [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#ordinalencoder) | Encode categorical features as an integer array. | Turn unique categorical values into a range of integers, for example, 0 maps to \"cat\", 1 maps to \"dog\" and more. | Categorical variables/features. |\n", - "| [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#targetencoder) | Encode regression and classification targets into a shrunk estimate of the average target values for observations of the category. - Useful for converting targets into a certain range of values. | Target variables. |\n", - "\n", - "For our case, we're going to start with `OrdinalEncoder`.\n", - "\n", - "When transforming/encoding values with Scikit-Learn, the steps as follows:\n", - "\n", - "1. Instantiate an encoder, for example, `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Use the [`sklearn.preprocessing.OrdinalEncoder.fit`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit) method on the **training** data (this helps the encoder learn a mapping of categorical to numeric values).\n", - "3. Use the [`sklearn.preprocessing.OrdinalEncoder.transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.transform) method on the **training** data to apply the learned mapping from categorical to numeric values.\n", - " * **Note:** The [`sklearn.preprocessing.OrdinalEncoder.fit_transform`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder.fit_transform) method combines steps 1 & 2 into a single method.\n", - "4. Apply the learned mapping to subsequent datasets such as **validation** and **test** splits using `sklearn.preprocessing.OrdinalEncoder.transform` only.\n", - "\n", - "Notice how the `fit` and `fit_transform` methods were reserved for the **training dataset only**.\n", - "\n", - "This is because in practice the validation and testing datasets are meant to be unseen, meaning only information from the training dataset should be used to preprocess the validation/test datasets.\n", - "\n", - "In short:\n", - "\n", - "1. Instantiate an encoder such as `sklearn.preprocessing.OrdinalEncoder`.\n", - "2. Fit the encoder to and transform the training dataset categorical variables/features with `sklearn.preprocessing.OrdinalEncoder.fit_transform`.\n", - "3. Transform categorical variables/features from subsequent datasets such as the validation and test datasets with the learned encoding from step 2 using `sklearn.preprocessing.OridinalEncoder.transform`. \n", - " * **Note:** Notice the use of the `transform` method on validation/test datasets rather than `fit_transform`.\n", - "\n", - "Let's do it!\n", - "\n", - "We'll use the `OrdinalEncoder` class to fill any missing values with `np.nan` (`NaN`).\n", - "\n", - "We'll also make sure to only use the `OrdinalEncoder` on the categorical features of our DataFrame.\n", - "\n", - "Finally, the `OrdinalEncoder` expects all input variables to be of the same type (e.g. either numeric only or string only) so we'll make sure all the input variables are strings only using [`pandas.DataFrame.astype(str)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Nice!\n", - "\n", - "We define our different feature types so we can use different preprocessing methods on each type.\n", - "\n", - "Scikit-Learn has many built-in methods for preprocessing data under the [`sklearn.preprocessing` module](https://scikit-learn.org/stable/api/sklearn.preprocessing.html#).\n", - "\n", - "And I'd encourage you to spend some time reading the [preprocessing data section of the Scikit-Learn user guide](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) for more details.\n", - "\n", - "For now, let's focus on turning our categorical features into numbers (from object/string datatype to numeric datatype).\n", - "\n", - "The practice of turning non-numerical features into numerical features is often referred to as [**encoding**](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features).\n", - "\n", - "There are several encoders available for different use cases.\n", - "\n", - "TK - does this table show up?\n", - "\n", - "\n", "\n", "For our case, we're going to start with `OrdinalEncoder`.\n", "\n", @@ -12765,16 +12711,6 @@ "print(\"[INFO] Pipeline with one hot encoding scores:\")\n", "pipeline_one_hot_scores" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Next:\n", - "# Go through TK's" - ] } ], "metadata": {