unbalancedparentheses · entropidelic · Feb 16, 2023 · Feb 16, 2023 · Feb 16, 2023 · Apr 4, 2023
diff --git a/04_naive_bayes.Rmd b/04_naive_bayes.Rmd
@@ -4,7 +4,7 @@ editor_options:
     wrap: 72
 ---
 
-# Spam filter
+# Spam filter: An introductory classification model
 
 ## Naive Bayes: Spam or Ham?
 
@@ -24,42 +24,53 @@ using Random
 Random.seed!(123)
 ```
 
-Nobody likes spam emails. How can Bayes help? In this chapter, we'll
+Nobody likes spam emails. How can Bayes' theorem help? In this chapter, we'll
 keep expanding our data science knowledge with a practical example. A
 simple yet effective way of using Bayesian probability to create a spam
 filter from scratch will be introduced. The filter will examine emails
 and classify them as either spam or ham (the word for non-spam emails)
 based on their content.
 
 What we will be implementing here is a *supervised learning model*, in
-other words, a classification model that has been trained on previously
-classified data. Think of it like a machine to which you can give some
-input, like an email, and will give you some label to that input, like
-spam or ham. This machine has a lot of tiny knobs, and based on their
-particular configuration it will output some label for each input.
+other words, a classification a model that can "learn" 
+to associate the target variable (email type) with the input variables 
+(words contained in the email). Think of it like a machine that is feeded 
+with an email and will output a label associated to it, like
+spam or ham. This machine has a lot of tiny knobs -also called parameters of
+the model- and based on their particular configuration, the model will output
+some label or another.
 Supervised learning involves iteratively finding the right configuration
 of these knobs by letting the machine make a guess with some
 pre-classified data, checking if the guess matches the true label, and
 if not, tune the knobs in some controlled way. The way our machine will
 make predictions is based on the underlying mathematical model. For a
 spam filter, a *naive Bayes* approach has proven to be effective, and
 you will have the opportunity to verify that yourself at the end of the
-chapter. In a naive Bayes model, Bayes' theorem is the main tool for
+chapter. In such a model, Bayes' theorem is the main tool for
 classifying, and it is *naive* because we make very loose assumptions
-about the data we are analyzing. This will be clearer once we dive into
-the implementation.
+about the data we are analyzing. When creating models we usually take 
+decisions to make our lives easier, and they usually come at the expense 
+of making the model simpler. It is usually a good practice to start building 
+simpler models and add complexity only as needed. This will be clearer once 
+we dive into the implementation.
+In summary, a successful spam filter model might infer from the training 
+data that emails containing the word “discount” have a high probability 
+of being spam.
 
-## The Training Data
+## The data
 
 For the Bayesian spam filter to work correctly, we need to feed it some
 good training data. In this context, that means having a large enough
-corpus of emails that have been pre-classified as spam or ham. The
+corpus [^1] of emails that have been pre-classified as spam or ham. The
 emails should be collected from a sufficiently heterogeneous group of
 people. After all, spam is a somewhat subjective category: one person's
 spam may be another person's ham. The proportion of spam vs. ham in our
 data should also be somewhat representative of the real proportion of
 emails we receive.
 
+[^1]: A large and structured set of text data, very oftenly used to train 
+language models.
+
 Fortunately, there are a lot of very good datasets available online.
 We'll use the "Email Spam Classification Dataset CSV" from
 [Kaggle](https://www.kaggle.com/balaka18/email-spam-classification-dataset-csv),
@@ -198,16 +209,16 @@ discover $P(email|spam)$. The new email looks like this:
 The new email contains the words *win* and *product*, which are rather
 common in our example's training data. We would therefore expect
 $P(email|spam)$, the probability of the new email being generated by the
-words encountered in the training spam email set, to be relatively high.
+words encountered in the training spam email set, to be relatively high. [^2]
 
-(The word \\emph{win} appears in the form \\emph{won} in the training
+[^2]: The word **win** appears in the form **won** in the training
 set, but that's OK. The standard linguistic technique of
-\\emph{lemmatization} groups together any related forms of a word and
-treats them as the same word.)
+**lemmatization** groups together any related forms of a word and
+treats them as the same word.
 
 Mathematically, the way to calculate $P(email|spam)$ is to take each
 word in our target email, calculate the probability of it appearing in
-spam emails based on our training set, and multiply those probabilties
+spam emails based on our training set, and multiply those probabilities 
 together.
 
 $P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)$
@@ -218,9 +229,9 @@ the training ham email set:
 
 $P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)$
 
-The multiplication of each of the probabilities associated with a
-particular word here stems from the naive assumption that all the words
-in the email are statistically independent. In reality, this assumption
+The multiplication of each of the word probabilities here stands from the
+naive supposition that all the words in the email are conditionally 
+independent given the class (spam or ham). In reality, this assumption
 isn't necessarily true. In fact, it's most likely false. Words in a
 language are never independent from one another, but this simple
 assumption seems to be enough for the level of complexity our problem
@@ -229,8 +240,15 @@ requires.
 The probability of a given word $word_i$ being in a given category is
 calculated like so:
 
-$$P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}$$
-$$P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}$$
+\begin{equation}
+    \tag{1.1}
+    P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}
+\end{equation}
+
+\begin{equation}
+    \tag{1.2}
+    P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}
+\end{equation}
 
 These formulas tell us exactly what we have to calculate from our data.
 We need the numbers $N_{word_i|spam}$ and $N_{word_i|ham}$ for each
@@ -242,7 +260,8 @@ in the dataset. The variable $\alpha$ is a smoothing parameter that
 prevents the probability of a given word being in a given category from
 going down to zero. If a given word hasn't appeared in the spam category
 in our training dataset, for example, we don't want to assign it zero
-probability of appearing in new spam emails.
+probability of appearing in new spam emails. See the [appendix](#appendix-alpha)
+for more details.
 
 As all of this information will be specific to our dataset, a clever way
 to aggregate it is to use a Julia *struct*, with attributes for the
@@ -298,7 +317,7 @@ modifies its arguments in-place (in this case, the spam filter struct
 itself). This function *fits* our model to the data, a typical procedure
 in data science and machine learning areas.
 
-```{julia, results = TRUE}
+```{julia, results = FALSE}
 function fit!(model::BayesSpamFilter, x_train, y_train, voc)
     model.vocabulary = voc
     model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
@@ -337,9 +356,9 @@ testing portion to evaluate the model's accuracy later.
 Now that we have our model, we can use it to make some spam vs. ham
 predictions and assess its performance. We'll define a few more
 functions to help with this process. First, we need a function
-implementing the TAL formula that we discussed earlier.
+implementing formulas $(1.1)$ and $(1.2)$. 
 
-```{julia, results = TRUE}
+```{julia, results = FALSE}
 function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
     ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
     spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
@@ -432,14 +451,16 @@ five emails in the test data.
 predictions[1:5]
 ```
 
-Of the first five emails, one (the third) was classified as spam, and
+Of the first five emails, the third and the fifth were classified as spam, while
 the rest were classified as ham.
 
 ## Evaluating the Accuracy
 
 Looking at the predictions themselves is pretty meaningless; what we
-really want to know is the model's accuracy. We'll define another
-function to calculate this.
+really want is to have some metric which can help us evaluate the effectiveness
+of our model in a quantitative manner. Usually, the first approach to this is
+to calculate the model's accuracy. 
+We'll define another function for this calculation.
 
 ```{julia, results = FALSE}
 function spam_filter_accuracy(predictions, actual)
@@ -484,16 +505,14 @@ function that builds a confusion matrix for our spam filter:
 
 ```{julia, results = FALSE}
 function spam_filter_confusion_matrix(y_test, predictions)
-    # 2x2 matrix is instantiated with zeros
     confusion_matrix = zeros((2, 2))
 
-    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
-    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
-    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
-    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
+    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test))
+    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test))
+    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test))
+    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test))
 
-    # Now we convert the confusion matrix into a DataFrame 
-    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
+    confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[])
     confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
     confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
 
@@ -549,7 +568,7 @@ functions to fit the spam filter object to the data. Finally, we made
 predictions on new data and evaluated our model's performance by
 calculating the accuracy and making a confusion matrix.
 
-## Appendix - A little more about alpha
+## Appendix - A little more about alpha {#appendix-alpha}
 
 As we have seen, to calculate the probability of the email being a spam
 email, we should use

diff --git a/04_naive_bayes/tmp.jl b/04_naive_bayes/tmp.jl
@@ -128,13 +128,13 @@ function spam_filter_confusion_matrix(y_test, predictions)
     # 2x2 matrix is instantiated with zeros
     confusion_matrix = zeros((2, 2))
 
-    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
-    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
-    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
-    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
+    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in eachindex(y_test))
+    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in eachindex(y_test))
+    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in eachindex(y_test))
+    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in eachindex(y_test))
 
     # Now we convert the confusion matrix into a DataFrame 
-    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
+    confusion_df = DataFrame(prediction=String[], ham_mail=Integer[], spam_mail=Integer[])
     confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
     confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))