Review

prof-rossetti · Sep 22, 2024 · e959884 · e959884
1 parent 9ffc729
commit e959884
Show file tree

Hide file tree

Showing 6 changed files with 71 additions and 74 deletions.
diff --git a/docs/images/predictive-modeling-for-finance.webp b/docs/images/predictive-modeling-for-finance.webp
diff --git a/docs/images/weather-forecast-7-day-outlook.webp b/docs/images/weather-forecast-7-day-outlook.webp
diff --git a/docs/notes/predictive-modeling/index.qmd b/docs/notes/predictive-modeling/index.qmd
@@ -4,11 +4,13 @@
 
 **Predictive modeling** refers to the use of statistical techniques and machine learning algorithms to predict future outcomes based on historical data. It involves creating models that learn patterns from past observations and use them to forecast future trends or behavior. At its core, predictive modeling is about understanding the relationships between variables and using these relationships to make informed predictions.
 
+![A seven day weather forecast. [Source](https://i0.wp.com/www.metgraphics.net/wp-content/uploads/2018/09/T52-7-Day.jpg?ssl=1).](../../images/weather-forecast-7-day-outlook.webp)
 
 ## Relevance of Predictive Modeling in Finance
 
 Predictive modeling has become increasingly relevant in the financial industry due to its ability to analyze vast amounts of data and provide forecasts that support strategic decision-making. In finance, the ability to make accurate predictions can provide a significant competitive edge. Whether it's forecasting stock prices, assessing credit risk, or predicting customer behavior, predictive models have transformed how financial professionals make decisions.
 
+
 Here are some key areas where predictive modeling is applied in finance:
 
   + **Risk Management**: Predictive models are used to assess the likelihood of defaults in loans or investments. By analyzing historical data on borrowers, lenders can develop models that predict the probability of default, helping them manage credit risk effectively.
@@ -23,7 +25,7 @@ Here are some key areas where predictive modeling is applied in finance:
 
 ## Why Python for Predictive Modeling in Finance?
 
-Python has emerged as a dominant language in the finance industry for predictive modeling due to its simplicity, flexibility, and vast ecosystem of libraries. Here's why Python is particularly suited for finance:
+Python has emerged as a dominant language in the finance industry for predictive modeling due to its simplicity, flexibility, and vast ecosystem of third-party open source libraries. Here's why Python is particularly suited for finance:
 
   + **Ease of Use**: Python's simple syntax makes it easy for both beginners and experienced developers to build complex models without getting bogged down by the intricacies of coding.
 
@@ -35,8 +37,8 @@ Python has emerged as a dominant language in the finance industry for predictive
 
   + **Integration with Financial Systems**: Python seamlessly integrates with databases, APIs, and financial platforms, enabling real-time data analysis and model deployment in production environments.
 
-In this course, we'll explore how to leverage Python's capabilities to build predictive models tailored to various financial applications, starting from data preparation to model validation and prediction.
+In this course, we'll explore how to leverage Python's capabilities to build predictive models tailored to various financial applications.
 
 ## Summary
 
-Predictive modeling is a powerful tool for making data-driven decisions, and its importance in finance cannot be overstated. With growing amounts of data and advances in machine learning, the ability to make accurate financial predictions is now more accessible than ever. In the following chapters, we will dive deep into the practical aspects of building these models using Python, from basic concepts to advanced techniques, helping you develop the skills to apply predictive analytics in real-world finance.
+Predictive modeling is a powerful tool for making data-driven decisions, and its importance in finance cannot be overstated. With growing amounts of data and advances in machine learning, the ability to make accurate financial predictions is now more accessible than ever. In the following chapters, we will dive deep into the practical aspects of building these models using Python, from basic concepts to advanced techniques, equipping you with the knowledge and tools to effectively leverage predictive analytics in real-world financial scenarios.
diff --git a/docs/notes/predictive-modeling/ml-foundations/generalization.qmd b/docs/notes/predictive-modeling/ml-foundations/generalization.qmd
@@ -1,14 +1,29 @@
 # Generalization
 
 
-In machine learning, **generalization** refers to the model's ability to perform well on unseen data, rather than simply memorizing the patterns in the training set. A model that generalizes well captures the underlying structure of the data and performs accurately when exposed to new inputs. Achieving good generalization is one of the primary goals in machine learning.
+In machine learning, **generalization** refers to the model's ability to perform well on unseen data, rather than simply memorizing specific patterns in the training set. A model that generalizes well captures the underlying structure of the data and performs accurately when exposed to new inputs. Achieving good generalization is one of the primary goals in machine learning.
 
 Let's discuss key concepts related to generalization, including the trade-off between overfitting and underfitting, the importance of splitting datasets into training and testing sets, and the role of cross-validation in evaluating model performance.
 
+
+## Bias vs Variance
+
+In the context of generalization, bias and variance represent two types of errors that can affect a model's performance.
+
+**Bias** refers to errors introduced by overly simplistic models that fail to capture the underlying patterns in the data. A high-bias model makes strong assumptions about the data, resulting in consistently poor predictions on both the training and test sets.
+
+On the other hand, **variance** refers to errors caused by overly complex models that fit the training data too closely, capturing noise along with the signal. This leads the model to perform well on the training data, but poorly on unseen test data.
+
+![Illustration of bias vs variance, using a bull's eye. Source: [Gudivada 2017 Data](https://www.researchgate.net/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning_Going_Beyond_Data_Cleaning_and_Transformations).](../../../images/bias-variance-tradeoff.ppm)
+
+Using an analogy of darts on a bull's eye, low bias means the predictions are accurately centered around the optimal target, whereas high bias means the attempts are consistently off the mark. Low variance means all the attempts are concentrated in a specific region, whereas high variance means the attempts are spread out over a wider region. The goal is to achieve results  which are accurate (with low bias), and consistent (with low variance).
+
+
 ## Overfitting vs Underfitting
 
 ![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [`sklearn` package](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html)](../../../images/sklearn-underfitting-overfitting.png)
 
+
 ### Overfitting
 
 **Overfitting** occurs when a model is too complex and learns not only the underlying patterns but also the noise and random fluctuations in the training data. An overfitted model performs very well on the training data but fails to generalize to unseen data. This results in poor performance on the test set, as the model struggles to adapt to new inputs that do not perfectly match the training data.
@@ -21,15 +36,11 @@ Common causes of overfitting include:
   + Training the model for too long without proper regularization.
   + Using too many features or irrelevant features.
 
-Symptoms of overfitting:
-
-  + Very low training error, but significantly higher error on the validation or test set.
-  + High variance in performance across different subsets of the data.
 
 
 ### Underfitting
 
-**Underfitting** occurs when a model is too simple to capture the underlying structure of the data. An underfitted model performs poorly both on the training data and the test data because it fails to learn the important relationships between input features and output labels.
+**Underfitting** occurs when a model is too simple to capture the underlying structure of the data. An underfitted model performs poorly on both the training data and the test data because it fails to learn the important relationships between input features and output labels.
 
 In technical terms, underfitting happens when a model has high bias but low variance. The model is too rigid, making overly simplistic predictions that do not adequately capture the complexities of the data.
 
@@ -39,21 +50,11 @@ Common causes of underfitting include:
   + Not training the model long enough or with sufficient data.
   + Using too few features or ignoring important features.
 
-Symptoms of underfitting:
-
-  + High error on both the training set and the test set.
-  + The model makes simplistic predictions that fail to capture the complexity of the data.
-
 ### Finding a Balance
 
-In the context of generalization, bias and variance represent two types of errors that can affect a model's performance. **Bias** refers to errors introduced by overly simplistic models that fail to capture the underlying patterns in the data, leading to underfitting. A high-bias model makes strong assumptions about the data, resulting in consistently poor predictions on both the training and test sets. On the other hand, **variance** refers to errors caused by overly complex models that fit the training data too closely, capturing noise along with the signal. This leads to overfitting, where the model performs well on the training data but poorly on unseen test data.
-
-
-![Illustration of bias vs variance, using a bulls-eye. Source: [Gudivada 2017 Data](https://www.researchgate.net/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning_Going_Beyond_Data_Cleaning_and_Transformations).](../../../images/bias-variance-tradeoff.ppm)
 
 The challenge in machine learning is to find the right balance between bias and variance, often called the bias-variance tradeoff, in order to achieve good generalization. A model with the right balance will generalize well to new data by capturing the essential patterns without being too sensitive to specific details in the training data.
 
-
 ![Three different models, illustrating trade-offs between overfitting and underfitting. Source: [AWS Machine Learning](https://docs.aws.amazon.com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.html).](../../../images/aws-underfitting-overfitting.png)
 
 
@@ -62,11 +63,8 @@ The challenge in machine learning is to find the right balance between bias and
 
 
 
-Additional resources about generalization:
 
-  + <https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html>
-  + <https://developers.google.com/machine-learning/crash-course/overfitting/generalization>
-  + <https://developers.google.com/machine-learning/crash-course/overfitting/overfitting>
+
 
 
 ## Data Splitting
@@ -106,7 +104,7 @@ With **cross validation**, instead of relying on a single training or validation
 ![K-fold cross validation (k=4). Source: [Google ML Concepts](https://developers.google.com/machine-learning/glossary#k-fold-cross-validation).](../../../images/k-fold-cross-validation.png)
 
 
-The dataset is divided into several folds (commonly called **K-fold cross-validation**), and the model is trained and validated on different subsets of the data in each iteration. This provides a more comprehensive understanding of the model's performance across various data splits, making it less sensitive to any specific partitioning.
+In **K-fold cross validation**, the dataset is divided into several subsets or "folds", and the model is trained and validated on different subsets of the data in each iteration. This provides a more comprehensive understanding of the model's performance across various data splits, making it less sensitive to any specific partitioning.
 
 
 Cross validation is especially valuable when fine-tuning model hyperparameters, as it prevents overfitting to a specific validation set or the test set by providing a more generalized evaluation before the final test set assessment.
@@ -119,12 +117,20 @@ This section provides some practical methods for splitting data in Python.
 
 ### Shuffled Splits
 
-In most machine learning problems, we typically perform a shuffled split, where the order of the data is randomized before partitioning it into training and testing sets. This helps ensure that the distribution in the training set closely resembles that of the test set, which reduces potential biases.
+In most machine learning problems, we typically perform a **shuffled split**, where the order of the data is randomized before partitioning it into training and testing sets. This randomization helps ensure the distribution in the training set closely resembles that of the test set, reducing potential biases that could skew model performance.
+
 
 ![Shuffled train/test split. Source: [Real Python](https://files.realpython.com/media/fig-1.c489adc748c8.png).](../../../images/shuffled-train-test-split.webp)
 
+One major benefit of shuffling is that it helps prevent **sequential bias**, which occurs when the order of the data affects the model's ability to generalize. Sequential bias can lead to inconsistencies in predictions, especially if certain patterns or trends exist in the sequence of the data. For example:
+
+  + In a regression problem such as predicting student grades based on study hours, imagine the dataset is originally sorted with the highest-performing students listed first and the lowest performers listed last. If we split this data sequentially, the training set would contain only high achievers, leaving the model unprepared for students with lower performance, resulting in poor predictions.
+  + Similarly, in a classification task, such as distinguishing between images of dogs and cats, if the data is sequentially ordered by class (e.g. all dog images first, followed by cat images), a non-shuffled split could lead the model to train mostly on one class, causing it to over-predict that class and fail to generalize properly to the other.
 
-One common way of implementing a shuffled two-way split is to leverage the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from `sklearn`:
+By shuffling the data, we ensure a more representative sample in both the training and testing sets, reducing the risk of biased or inaccurate predictions.
+
+
+One common way of implementing a shuffled two-way split in Python is to leverage the [`train_test_split` function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from `sklearn`:
 
 
 ```python
@@ -156,64 +162,43 @@ If we shuffle the data when performing a train/test split for time-series foreca
 
   + **Loss of Temporal Structure**: Time series data inherently depends on the order of observations. Shuffling breaks the sequence and removes temporal relationships, leading the model to learn patterns that don't reflect how time-dependent data actually behaves. This can distort predictions and diminish the model's forecasting ability.
 
-  + **Unreliable Performance Metrics**: If the model is trained on future data, performance metrics like accuracy or RMSE will be unrealistically high, but once deployed, the model's performance will significantly degrade as it won't have access to future data in a real-time scenario.
+  + **Unreliable Performance Metrics**: If the model is trained on future data, performance metrics like accuracy or MSE will be unrealistically high, but once deployed, the model's performance will significantly degrade as it won't have access to future data at inference time.
 
-Instead of shuffling, we can split based on time, using methods like time series splits, or time-based cross-validation, ensuring that the training set only contains past data relative to the test set.
+Instead of shuffling time series data, we can split based on time, using methods like sequential splits, or time-based cross-validation, ensuring that the training set only contains past data relative to the test set.
 
 To implement a sequential split, assuming your data is sorted by date in ascending order, pick a cutoff date, and use all samples before the cutoff in the training set, and samples after the cutoff date in the test set. This ensures the model can't rely on data from the future when making predictions for the test set.
 
 ```python
-training_size = round(len(df) * .8)
+training_size = 0.8
+cutoff = round(len(df) * training_size)
 
-x_train = x.iloc[:training_size] # all before cutoff
-y_train = y.iloc[:training_size] # all before cutoff
+x_train = x.iloc[:cutoff] # all before cutoff
+y_train = y.iloc[:cutoff] # all before cutoff
 
-x_test = x.iloc[training_size:] # all after cutoff
-y_test = y.iloc[training_size:] # all after cutoff
+x_test = x.iloc[cutoff:] # all after cutoff
+y_test = y.iloc[cutoff:] # all after cutoff
 
 print("TRAIN:", x_train.shape, y_train.shape)
 print("TEST:", x_test.shape, y_test.shape)
 ```
 
-Helper function:
+For your reference, here is a reusable helper function containing the same sequential split logic:
 
 ```python
-def sequential_split(df, training_size=0.8, test_size=None):
-    """Splits x and y sequentially, for time-series data.
-      Assumes data is already sorted by date in ascending order.
-      Assumes x is same size as y.
-      Calculates a cutoff based on the desired training or test size.
-      Gives training set as all datapoints before the cutoff,
-        and test set as all datapoints after the cutoff.
-
-    Args:
-        x (DataFrame): X values to be split.
-
-        y (Series): Y values to be split.
-
-        training_size (float, optional): The proportion of the data to use for training.
-            Defaults to 0.8.
-
-        test_size (float, optional): The proportion of the data to use for testing.
-            If provided, it overrides `training_size`. Should be between 0 and 1.
-
-    """
-    if test_size:
-        training_size = 1 - test_size
-
+def sequential_split(x, y, training_size=0.8, test_size=None):
+    training_size = 1 - test_size if test_size
     assert len(x) == len(y)
     cutoff = round(len(x) * training_size)
 
     x_train = x.iloc[:cutoff] # all before cutoff
     y_train = y.iloc[:cutoff] # all before cutoff
-
     x_test = x.iloc[cutoff:] # all after cutoff
     y_test = y.iloc[cutoff:] # all after cutoff
 
     return x_train, x_test, y_train, y_test
 
 
-x_train, x_test, y_train, y_test = sequential_split(x, y, test_size=.2)
+x_train, x_test, y_train, y_test = sequential_split(x, y)
 print("TRAIN:", x_train.shape, y_train.shape)
 print("TEST:", x_test.shape, y_test.shape)
 ```