Skip to content

Chapter 04 - Apply Osvaldo and Mari Sarabia corrections #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 56 additions & 37 deletions 04_naive_bayes.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ editor_options:
wrap: 72
---

# Spam filter
# Spam filter: An introductory classification model

## Naive Bayes: Spam or Ham?

Expand All @@ -24,42 +24,53 @@ using Random
Random.seed!(123)
```

Nobody likes spam emails. How can Bayes help? In this chapter, we'll
Nobody likes spam emails. How can Bayes' theorem help? In this chapter, we'll
keep expanding our data science knowledge with a practical example. A
simple yet effective way of using Bayesian probability to create a spam
filter from scratch will be introduced. The filter will examine emails
and classify them as either spam or ham (the word for non-spam emails)
based on their content.

What we will be implementing here is a *supervised learning model*, in
other words, a classification model that has been trained on previously
classified data. Think of it like a machine to which you can give some
input, like an email, and will give you some label to that input, like
spam or ham. This machine has a lot of tiny knobs, and based on their
particular configuration it will output some label for each input.
other words, a classification a model that can "learn"
to associate the target variable (email type) with the input variables
(words contained in the email). Think of it like a machine that is feeded
with an email and will output a label associated to it, like
spam or ham. This machine has a lot of tiny knobs -also called parameters of
the model- and based on their particular configuration, the model will output
some label or another.
Supervised learning involves iteratively finding the right configuration
of these knobs by letting the machine make a guess with some
pre-classified data, checking if the guess matches the true label, and
if not, tune the knobs in some controlled way. The way our machine will
make predictions is based on the underlying mathematical model. For a
spam filter, a *naive Bayes* approach has proven to be effective, and
you will have the opportunity to verify that yourself at the end of the
chapter. In a naive Bayes model, Bayes' theorem is the main tool for
chapter. In such a model, Bayes' theorem is the main tool for
classifying, and it is *naive* because we make very loose assumptions
about the data we are analyzing. This will be clearer once we dive into
the implementation.
about the data we are analyzing. When creating models we usually take
decisions to make our lives easier, and they usually come at the expense
of making the model simpler. It is usually a good practice to start building
simpler models and add complexity only as needed. This will be clearer once
we dive into the implementation.
In summary, a successful spam filter model might infer from the training
data that emails containing the word “discount” have a high probability
of being spam.

## The Training Data
## The data

For the Bayesian spam filter to work correctly, we need to feed it some
good training data. In this context, that means having a large enough
corpus of emails that have been pre-classified as spam or ham. The
corpus [^1] of emails that have been pre-classified as spam or ham. The
emails should be collected from a sufficiently heterogeneous group of
people. After all, spam is a somewhat subjective category: one person's
spam may be another person's ham. The proportion of spam vs. ham in our
data should also be somewhat representative of the real proportion of
emails we receive.

[^1]: A large and structured set of text data, very oftenly used to train
language models.

Fortunately, there are a lot of very good datasets available online.
We'll use the "Email Spam Classification Dataset CSV" from
[Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv),
Expand Down Expand Up @@ -198,16 +209,16 @@ discover $P(email|spam)$. The new email looks like this:
The new email contains the words *win* and *product*, which are rather
common in our example's training data. We would therefore expect
$P(email|spam)$, the probability of the new email being generated by the
words encountered in the training spam email set, to be relatively high.
words encountered in the training spam email set, to be relatively high. [^2]

(The word \\emph{win} appears in the form \\emph{won} in the training
[^2]: The word **win** appears in the form **won** in the training
set, but that's OK. The standard linguistic technique of
\\emph{lemmatization} groups together any related forms of a word and
treats them as the same word.)
**lemmatization** groups together any related forms of a word and
treats them as the same word.

Mathematically, the way to calculate $P(email|spam)$ is to take each
word in our target email, calculate the probability of it appearing in
spam emails based on our training set, and multiply those probabilties
spam emails based on our training set, and multiply those probabilities
together.

$P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$
Expand All @@ -218,9 +229,9 @@ the training ham email set:

$P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$

The multiplication of each of the probabilities associated with a
particular word here stems from the naive assumption that all the words
in the email are statistically independent. In reality, this assumption
The multiplication of each of the word probabilities here stands from the
naive supposition that all the words in the email are conditionally
independent given the class (spam or ham). In reality, this assumption
isn't necessarily true. In fact, it's most likely false. Words in a
language are never independent from one another, but this simple
assumption seems to be enough for the level of complexity our problem
Expand All @@ -229,8 +240,15 @@ requires.
The probability of a given word $word_i$ being in a given category is
calculated like so:

$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$
$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$
\begin{equation}
\tag{1.1}
P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}
\end{equation}

\begin{equation}
\tag{1.2}
P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}
\end{equation}

These formulas tell us exactly what we have to calculate from our data.
We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each
Expand All @@ -242,7 +260,8 @@ in the dataset. The variable $\alpha$ is a smoothing parameter that
prevents the probability of a given word being in a given category from
going down to zero. If a given word hasn't appeared in the spam category
in our training dataset, for example, we don't want to assign it zero
probability of appearing in new spam emails.
probability of appearing in new spam emails. See the [appendix](#appendix-alpha)
for more details.

As all of this information will be specific to our dataset, a clever way
to aggregate it is to use a Julia *struct*, with attributes for the
Expand Down Expand Up @@ -298,7 +317,7 @@ modifies its arguments in-place (in this case, the spam filter struct
itself). This function *fits* our model to the data, a typical procedure
in data science and machine learning areas.

```{julia, results = TRUE}
```{julia, results = FALSE}
function fit!(model::BayesSpamFilter, x_train, y_train, voc)
model.vocabulary = voc
model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
Expand Down Expand Up @@ -337,9 +356,9 @@ testing portion to evaluate the model's accuracy later.
Now that we have our model, we can use it to make some spam vs. ham
predictions and assess its performance. We'll define a few more
functions to help with this process. First, we need a function
implementing the TAL formula that we discussed earlier.
implementing formulas $(1.1)$ and $(1.2)$.

```{julia, results = TRUE}
```{julia, results = FALSE}
function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
Expand Down Expand Up @@ -432,14 +451,16 @@ five emails in the test data.
predictions[1:5]
```

Of the first five emails, one (the third) was classified as spam, and
Of the first five emails, the third and the fifth were classified as spam, while
the rest were classified as ham.

## Evaluating the Accuracy

Looking at the predictions themselves is pretty meaningless; what we
really want to know is the model's accuracy. We'll define another
function to calculate this.
really want is to have some metric which can help us evaluate the effectiveness
of our model in a quantitative manner. Usually, the first approach to this is
to calculate the model's accuracy.
We'll define another function for this calculation.

```{julia, results = FALSE}
function spam_filter_accuracy(predictions, actual)
Expand Down Expand Up @@ -484,16 +505,14 @@ function that builds a confusion matrix for our spam filter:

```{julia, results = FALSE}
function spam_filter_confusion_matrix(y_test, predictions)
# 2x2 matrix is instantiated with zeros
confusion_matrix = zeros((2, 2))

confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test))

# Now we convert the confusion matrix into a DataFrame
confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[])
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))

Expand Down Expand Up @@ -549,7 +568,7 @@ functions to fit the spam filter object to the data. Finally, we made
predictions on new data and evaluated our model's performance by
calculating the accuracy and making a confusion matrix.

## Appendix - A little more about alpha
## Appendix - A little more about alpha {#appendix-alpha}

As we have seen, to calculate the probability of the email being a spam
email, we should use
Expand Down
10 changes: 5 additions & 5 deletions 04_naive_bayes/tmp.jl
Original file line number Diff line number Diff line change
Expand Up @@ -128,13 +128,13 @@ function spam_filter_confusion_matrix(y_test, predictions)
# 2x2 matrix is instantiated with zeros
confusion_matrix = zeros((2, 2))

confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test))

# Now we convert the confusion matrix into a DataFrame
confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[])
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))

Expand Down
Loading