Skip to content

Commit

Permalink
Encoding code examples
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 22, 2024
1 parent 93ec355 commit 413df20
Showing 1 changed file with 109 additions and 14 deletions.
123 changes: 109 additions & 14 deletions docs/notes/predictive-modeling/ml-foundations/data-encoding.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,53 +8,148 @@ So if we have categorical or textual data, we need to use a **data encoding** st

We'll choose an encoding strategy based on the nature of the data. For categorical data, we'll use either ordinal encoding if there's an inherent order among the categories, or one-hot encoding if no such order exists. For time-series data, we can apply time step encoding to represent the temporal sequence of observations.

## Encoding Approaches
## For Categorical Data

### Ordinal Encoding for Categorical Data
### Ordinal Encoding

When the data has a natural order (i.e. where one category is "greater" or "less" than another), we use **ordinal encoding**. This involves converting the categories into a sequence of numbered values. For example, in a dataset containing ticket classes like "upper", "middle", and "lower", we can map these to integers (e.g. `[1, 2, 3]`, or `[3, 2, 1]`), maintaining the ordered relationship.
When the data has a natural order (i.e. where one category is "greater" or "less" than another), we use **ordinal encoding**. This involves converting the categories into a sequence of numbered values. For example, in a dataset containing ticket classes like "business", "priority", and "economy", we can map these to integers (e.g. `[1, 2, 3]`, or `[3, 2, 1]`), maintaining the ordered relationship.


![Example of ordinal encoding (passenger ticket classes).](../../../images/ordinal-encoding-passenger-class.png)


With ordinal encoding, we start with a column of textual category names, and we wind up with a column of numbers.

### One-hot Encoding for Categorical Data
To implement ordinal encoding in Python, you can consider a `DataFrame` mapping operation:

```{python}
#| code-fold: true
from pandas import DataFrame
# exmaple dataframe:
passengers_df = DataFrame({
"passenger_id": ["P1", "P2", "P3", "P4", "P5", "P6"],
"ticket_class": ["BUSINESS", "PRIORITY", "BUSINESS",
"ECONOMY", "ECONOMY", "ECONOMY"]
})
passengers_df.index = passengers_df["passenger_id"]
passengers_df.index.name = "passenger_id"
passengers_df.drop(columns=["passenger_id"], inplace=True)
#passengers_df
```

```{python}
ticket_class_map = {"BUSINESS": 3, "PRIORITY": 2, "ECONOMY": 1}
x_encoded = passengers_df["ticket_class"].map(ticket_class_map)
passengers_df["ticket_class_encoded"] = x_encoded
passengers_df
```

When categorical data has no natural order, we use a technique called **one-hot encoding** to convert it into a numerical format. In one-hot encoding, each category is represented as a row of binary values, where one element is set to 1 to indicate the presence of that category, and the remaining elements are set to 0, indicating the absence of all other categories.


### One-hot Encoding

When categorical data has no natural order, we use a technique called **one-hot encoding** to convert it into a numerical format. In one-hot encoding, each category is represented as a row of binary values, where one element is set to 1 (or True) to indicate the presence of that category, and the remaining elements are set to 0 (or False), indicating the absence of all other categories.

For example, if we have five color categories (blue, green, red, purple, yellow), one-hot encoding transforms the single column of colors into five separate columns. Each column corresponds to one color, and for any given row, a 1 appears in the column that matches the color, while the other columns contain 0s to represent the absence of the other colors.

![Example of one-hot encoding for categorical data (diamond colors).](../../../images/one-hot-encoding-diamond-color.png)

With ordinal encoding, we start with a column of textual category names, and we wind up with as many columns as there were unique values in the original column. This can potentially lead to a large number of features if there are a large number of categories present.
With one-hot encoding, we start with a column of textual category names, and we wind up with as many columns as there were unique values in the original column. This can potentially lead to a large number of features if there are a large number of categories present.

To implement one-hot encoding in Python, we can use the [`get_dummies` function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) from `pandas`, specifying a list of columns to encode in this way:

```{python}
#| code-fold: true
from pandas import DataFrame
# example dataframe
diamonds_df = DataFrame({
"diamond_id": ["D1", "D2", "D3", "D4", "D5", "D6"],
"color": ["blue", "green", "red", "purple", "yellow", "blue"]
})
diamonds_df.index = diamonds_df["diamond_id"]
diamonds_df.index.name = "diamond_id"
diamonds_df.drop(columns=["diamond_id"], inplace=True)
#diamonds_df
```
```{python}
from pandas import get_dummies as one_hot_encode
#x_encoded = one_hot_encode(diamonds_df, columns=["color"])
x_encoded = one_hot_encode(diamonds_df["color"])
x_encoded = x_encoded.astype("int")
x_encoded.columns = [f"color_{color}" for color in x_encoded.columns]
diamonds_df.merge(x_encoded, left_index=True, right_index=True)
```

```{python}
#diamonds_df.merge(x_encoded, left_index=True, right_index=True)
```

<!--
As an alternative method of performing one-hot encoding in Python, we can use the [`CountVectorizer` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) from `sklearn`:
```{python}
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x_encoded = cv.fit_transform(diamonds_df["color"])
print(type(x_encoded))
columns = cv.get_feature_names_out()
x_encoded = DataFrame(x_encoded.toarray(), columns=columns)
#x_encoded
diamonds_df.merge(x_encoded, left_index=True, right_index=True)
```
```{python}
#diamonds_df.merge(x_encoded, left_index=True, right_index=True)
```
-->

### One-hot Encoding for Textual Data

#### Bag of Words


In natural language processing (NLP), we can use one-hot encoding to represent words (or "tokens") in a sentence. This approach is known as the "bag of words" method. In the bag of words approach, each document is transformed into a row of binary values, where 1 indicates the presence of a given word in that document, and 0 indicates absence of that word. This allows us to turn text into numbers that machine learning models can work with.

![Example of one-hot encoding for textual data.](../../../images/bag-of-words.png)
![Example of one-hot encoding for textual data (i.e. "bag of words" approach).](../../../images/bag-of-words.png)

The bag of words approach is a simple count-based text embedding approach, however more advanced alternative approaches include Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings. To implement count-based word embedding strategies in Python, we can use the [`CountVectorizer` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), or more commonly the [`TfidfVectorizer` class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from `sklearn`.

The bag of words approach is a simple count-based text embedding approach, however more advanced alternative approaches include Term Frequency-Inverse Document Frequency (TF-IDF), and word embeddings.
## For Time-series Data

### Time Step Encoding for Time-series Data
### Time Step Encoding

When working with time-series data, it's important to encode dates in a way that preserves the temporal structure. A common method is **time-step encoding**, where each timestamp is assigned a sequential integer value. For example, if our data is recorded daily, we would assign the earliest date a value of 1, the next day a value of 2, and so on. This approach works well for data collected at regular intervals, such as daily, monthly, or yearly.
When working with time-series data, it's important to encode dates in a way that preserves the temporal structure. A common method is **time-step encoding**, where each timestamp is assigned a sequential integer value. This approach works well for data collected at regular intervals (e.g. daily, monthly, quarterly, or yearly).

To implement this in Python, we first sort the dataset by date in ascending order, ensuring the earliest date appears first. Then, we create a new column with sequential integers, starting at 1 for the earliest date and incrementing by 1 for each subsequent date:

```python
```{python}
#| code-fold: true
from pandas import DataFrame
dates = ["1970-01-01", "1971-01-01", "1972-01-01", "1973-01-01", "1974-01-01",
"1975-01-01", "1976-01-01", "1977-01-01", "1978-01-01", "1979-01-01"]
df = DataFrame({"date": dates})
```

```{python}
df.sort_values(by="date", ascending=True, inplace=True)
df["time_step"] = range(1, len(df) + 1)
df
```

In some ways, this is similar to an ordinal encoding, where the order of values is maintained.

## Summary

In summary, different types of data require different encoding strategies. For categorical data, we use ordinal encoding when categories have a natural order, and one-hot encoding when no such order exists. In text data, one-hot encoding is applied through the "bag of words" method to represent the presence or absence of words in documents. For time-series data, time step encoding is used to preserve the temporal order of observations. Choosing the right encoding method ensures that the model can process the data effectively and make accurate predictions.

0 comments on commit 413df20

Please sign in to comment.