Autocorrelation

prof-rossetti · Sep 20, 2024 · 4d32d5f · 4d32d5f
1 parent 8ff1bdf
commit 4d32d5f
Show file tree

Hide file tree

Showing 2 changed files with 217 additions and 3 deletions.
diff --git a/docs/notes/predictive-modeling/autoregressive-models/autocorrelation.qmd b/docs/notes/predictive-modeling/autoregressive-models/autocorrelation.qmd
@@ -31,10 +31,223 @@ In addition to interpreting the autocorrelation values themselves, we can examin
 
 ## Calculating Autocorrelation in Python
 
-In Python, we can calculate autocorrelation using the [`acf` function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html) from `statsmodels. The autocorrelation function (ACF) calculates the correlation of a time series with its lagged values, providing a guide to the structure of dependencies within the data.
+In Python, we can calculate autocorrelation using the [`acf` function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html) from the `statsmodels` package. The autocorrelation function (ACF) calculates the correlation of a time series with its lagged values, providing a guide to the structure of dependencies within the data.
+
+
+```python
+from statsmodels.tsa.stattools import acf
+
+n_lags = 12 # we choose number of periods to consider
+
+acf_results = acf(time_series, nlags=n_lags, fft=False)
+print(type(acf_results)) #> np.ndarray
+print(len(acf_results)) #> 13
+```
+
+:::{.callout-note }
+When we obtain results from the autocorrelation function, we get one plus the number of lagging periods we chose. The first value represents a datapoint's correlation with itself, and is always equal to 1.
+:::
 
 ## Examples of Autocorrelation
 
-### Autocorrelation of Random Data
+Let's conduct autocorrelation analysis on two example datasets, to illustrate the concepts and techniques. We will use a randomly generated dataset of numbers, which exhibits weak or non-existant autocorrelation. We will then use a dataset of baseball team performance, to see which teams
+
+### Example 1: Autocorrelation of Random Data
+
+In this first example, we will use a randomly generated series of data, where there is no relationship between each value and its previous values.
+
+#### Data Simulation
+
+Here we are generating a random distribution of numbers using the [`random.normal` function](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) from `numpy`:
+
+```{python}
+import numpy as np
+
+y_rand = np.random.normal(loc=0, scale=1, size=1000) # mean, std, n_samples
+print(type(y_rand))
+print(y_rand.shape)
+```
+
+#### Data Exploration
+
+We plot the data to show although it is normally distributed, in terms of the sequence from one datapoint to another, it represents some random noise:
+
+```{python}
+import plotly.express as px
+
+px.histogram(y_rand, height=350, title="Random Numbers (Normal Distribution)")
+```
+
+```{python}
+px.scatter(y_rand, height=350, title="Random Numbers (Normal Distribution)")
+```
+
+#### Calculating Autocorrelation
+
+We use the `acf` function from `statsmodels` to calculate autocorrelation, passing the data series in as the first parameter:
+
+```{python}
+from statsmodels.tsa.stattools import acf
+
+n_lags = 10 # we choose number of periods to consider
+
+acf_rand = acf(y_rand, nlags=n_lags, fft=False)
+print(type(acf_rand))
+print(len(acf_rand))
+print(list(acf_rand.round(3)))
+```
+
+Finally, we plot the autocorrelation results to visually examine the autocorrelation structure of the data:
+
+```{python}
+px.line(y=acf_rand, markers=["o"], height=350,
+        title="Autocorrelation of a series of random numbers",
+        labels={"x": "Number of Lags", "y":"Autocorrelation"}
+)
+```
+
+We see, for this randomly generated dataset, although the the current value is perfectly correlated with itself (as expected), it has no correlation with the previous values.
+
+### Example 2: Autocorrelation of Baseball Team Performance
+
+Alright, so we have seen an example where there is weak autocorrelation. But let's examine another example where there is some moderately strong autocorrelation between current and past values. We will use a dataset of baseball team performance, where there may be some correlation between a team's current performance and its recent past performance.
+
+#### Data Loading
+
+Here we are loading a dataset of baseball team statistics, for four different baseball teams:
+
+```{python}
+from pandas import read_excel
+
+repo_url = f"https://github.com/prof-rossetti/python-for-finance"
+file_url = f"{repo_url}/raw/refs/heads/main/docs/data/baseball_data.xlsx"
+
+teams = [
+    {"abbrev": "NYY", "sheet_name": "ny_yankees"  , "color": "#1f77b4"},
+    {"abbrev": "BOS", "sheet_name": "bo_redsox"   , "color": "#d62728"},
+    {"abbrev": "BAL", "sheet_name": "balt_orioles", "color": "#ff7f0e"},
+    {"abbrev": "TOR", "sheet_name": "tor_blujays" , "color": "#17becf"},
+]
+for team in teams:
+    team_df = read_excel(file_url, sheet_name=team["sheet_name"])
+    team_df.index = team_df["Year"]
+
+    print("----------------")
+    print(team["abbrev"], len(team_df), team_df.index.min(), team_df.index.max())
+    print(team_df.columns.tolist())
+
+    team["df"] = team_df # storing for later
+
+```
+
+We see there are a different number of rows for each of the teams, depending on what year they were established.
+
+Merging the dataset will make it easier for us to chart this data, especially when we only care about analyzing annual performance (win-loss percentage):
+
+```{python}
+from pandas import DataFrame
+
+df = DataFrame()
+for team in teams:
+    df[team["abbrev"]] = team["df"]["W-L%"]
+df
+```
+
+#### Data Exploration
+
+Performing exploratory analysis:
+
+
+```{python}
+px.line(df, y=["NYY", "BOS", "BAL", "TOR"], height=450,
+    title="Baseball Team Annual Win Percentages",
+    labels={"value": "Win Percentage", "variable": "Team"}
+)
+```
+
+Whoah there's a lot going on here.
+
+:::{.callout-tip title="Interactive dataviz"}
+Click a team name in the legend to toggle that series on or off.
+:::
+
+We can use aggregations to get a better sense of which teams might do better on average.
+
+```{python}
+#df.describe().round(3)
+```
+
+```{python}
+means = df.mean(axis=0).round(3) # get the mean for each column
+means.name = "Average Performance"
+means.sort_values(ascending=True, inplace=True)
+
+team_colors_map = {team['abbrev']: team['color'] for team in teams}
+
+px.bar(y=means.index, x=means.values, orientation="h", height=350,
+       title=f"Average Win Percentage ({df.index.min()} to {df.index.max()})",
+        labels={"x": "Win Percentage", "y": "Team"},
+        color=means.index, color_discrete_map=team_colors_map
+    )
+```
+
+
+We can also calculate and visualize moving averages to get a smoother trend of each team's performance over time:
+
+```{python}
+window = 20
+
+ma_df = DataFrame()
+for team_name in df.columns:
+    moving_avg = df[team_name].rolling(window=window).mean()
+    #ma_df[f"{team_name}_ma_{window}"] = moving_avg
+    ma_df[team_name] = moving_avg
+
+```
+
+```{python}
+px.line(ma_df, y=ma_df.columns.tolist(), height=450,
+        title=f"Baseball Team Win Percentages ({window} Year Moving Avg)",
+        labels={"value": "Win Percentage", "variable": "Team"},
+        color_discrete_map=team_colors_map
+
+)
+```
+
+#### Calculating Autocorrelation
+
+OK, sure we can analyze which teams do better on average, and how well each team performs over time, but with autocorrelation we care about how consistent each team's results are with its past performance (or put another way, how consistent each team's future results will be with its current performance).
+
+Calculating autocorrelation of performance for each team (using the same number of lagging periods for each team):
+
+```{python}
+from statsmodels.tsa.stattools import acf
+
+n_lags=10
+
+acf_df = DataFrame()
+for team_name in df.columns:
+    acf_results = acf(df[team_name], nlags=n_lags, fft=True, missing="drop")
+    acf_df[team_name] = acf_results
+
+```
+
+
+Plotting the autocorrelation results on a graph helps us compare the results for each team:
+
+```{python}
+px.line(acf_df, y=["NYY", "BOS", "BAL", "TOR"], markers="O", height=450,
+        title="Auto-correlation of Annual Baseball Team Performance",
+        labels={"variable": "Team", "value": "Autocorrelation",
+                "index": "Number of lags"
+        },
+        color_discrete_map=team_colors_map
+
+)
+```
+
+The autocorrelation results help us understand the consistency in performance of each team from year to year.
+
+For each team, how correlated is its performance from one year to the next year? How about two years out? How about three years out?
 
-### Autocorrelation of Baseball Team Performance
+Which team is most consistent in their performance from year to year, over the entire 10-year period?
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -24,6 +24,7 @@ lxml # bs4 needs this to parse XML
 scipy
 
 
+openpyxl # for pandas.read_excel
 pandas_datareader
 yahooquery
 yfinance