Skip to content

Commit

Permalink
Autocorrelation
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Sep 20, 2024
1 parent 8ff1bdf commit 4d32d5f
Show file tree
Hide file tree
Showing 2 changed files with 217 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,223 @@ In addition to interpreting the autocorrelation values themselves, we can examin

## Calculating Autocorrelation in Python

In Python, we can calculate autocorrelation using the [`acf` function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html) from `statsmodels. The autocorrelation function (ACF) calculates the correlation of a time series with its lagged values, providing a guide to the structure of dependencies within the data.
In Python, we can calculate autocorrelation using the [`acf` function](https://www.statsmodels.org/stable/generated/statsmodels.tsa.stattools.acf.html) from the `statsmodels` package. The autocorrelation function (ACF) calculates the correlation of a time series with its lagged values, providing a guide to the structure of dependencies within the data.


```python
from statsmodels.tsa.stattools import acf

n_lags = 12 # we choose number of periods to consider

acf_results = acf(time_series, nlags=n_lags, fft=False)
print(type(acf_results)) #> np.ndarray
print(len(acf_results)) #> 13
```

:::{.callout-note }
When we obtain results from the autocorrelation function, we get one plus the number of lagging periods we chose. The first value represents a datapoint's correlation with itself, and is always equal to 1.
:::

## Examples of Autocorrelation

### Autocorrelation of Random Data
Let's conduct autocorrelation analysis on two example datasets, to illustrate the concepts and techniques. We will use a randomly generated dataset of numbers, which exhibits weak or non-existant autocorrelation. We will then use a dataset of baseball team performance, to see which teams

### Example 1: Autocorrelation of Random Data

In this first example, we will use a randomly generated series of data, where there is no relationship between each value and its previous values.

#### Data Simulation

Here we are generating a random distribution of numbers using the [`random.normal` function](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html) from `numpy`:

```{python}
import numpy as np
y_rand = np.random.normal(loc=0, scale=1, size=1000) # mean, std, n_samples
print(type(y_rand))
print(y_rand.shape)
```

#### Data Exploration

We plot the data to show although it is normally distributed, in terms of the sequence from one datapoint to another, it represents some random noise:

```{python}
import plotly.express as px
px.histogram(y_rand, height=350, title="Random Numbers (Normal Distribution)")
```

```{python}
px.scatter(y_rand, height=350, title="Random Numbers (Normal Distribution)")
```

#### Calculating Autocorrelation

We use the `acf` function from `statsmodels` to calculate autocorrelation, passing the data series in as the first parameter:

```{python}
from statsmodels.tsa.stattools import acf
n_lags = 10 # we choose number of periods to consider
acf_rand = acf(y_rand, nlags=n_lags, fft=False)
print(type(acf_rand))
print(len(acf_rand))
print(list(acf_rand.round(3)))
```

Finally, we plot the autocorrelation results to visually examine the autocorrelation structure of the data:

```{python}
px.line(y=acf_rand, markers=["o"], height=350,
title="Autocorrelation of a series of random numbers",
labels={"x": "Number of Lags", "y":"Autocorrelation"}
)
```

We see, for this randomly generated dataset, although the the current value is perfectly correlated with itself (as expected), it has no correlation with the previous values.

### Example 2: Autocorrelation of Baseball Team Performance

Alright, so we have seen an example where there is weak autocorrelation. But let's examine another example where there is some moderately strong autocorrelation between current and past values. We will use a dataset of baseball team performance, where there may be some correlation between a team's current performance and its recent past performance.

#### Data Loading

Here we are loading a dataset of baseball team statistics, for four different baseball teams:

```{python}
from pandas import read_excel
repo_url = f"https://github.com/prof-rossetti/python-for-finance"
file_url = f"{repo_url}/raw/refs/heads/main/docs/data/baseball_data.xlsx"
teams = [
{"abbrev": "NYY", "sheet_name": "ny_yankees" , "color": "#1f77b4"},
{"abbrev": "BOS", "sheet_name": "bo_redsox" , "color": "#d62728"},
{"abbrev": "BAL", "sheet_name": "balt_orioles", "color": "#ff7f0e"},
{"abbrev": "TOR", "sheet_name": "tor_blujays" , "color": "#17becf"},
]
for team in teams:
team_df = read_excel(file_url, sheet_name=team["sheet_name"])
team_df.index = team_df["Year"]
print("----------------")
print(team["abbrev"], len(team_df), team_df.index.min(), team_df.index.max())
print(team_df.columns.tolist())
team["df"] = team_df # storing for later
```

We see there are a different number of rows for each of the teams, depending on what year they were established.

Merging the dataset will make it easier for us to chart this data, especially when we only care about analyzing annual performance (win-loss percentage):

```{python}
from pandas import DataFrame
df = DataFrame()
for team in teams:
df[team["abbrev"]] = team["df"]["W-L%"]
df
```

#### Data Exploration

Performing exploratory analysis:


```{python}
px.line(df, y=["NYY", "BOS", "BAL", "TOR"], height=450,
title="Baseball Team Annual Win Percentages",
labels={"value": "Win Percentage", "variable": "Team"}
)
```

Whoah there's a lot going on here.

:::{.callout-tip title="Interactive dataviz"}
Click a team name in the legend to toggle that series on or off.
:::

We can use aggregations to get a better sense of which teams might do better on average.

```{python}
#df.describe().round(3)
```

```{python}
means = df.mean(axis=0).round(3) # get the mean for each column
means.name = "Average Performance"
means.sort_values(ascending=True, inplace=True)
team_colors_map = {team['abbrev']: team['color'] for team in teams}
px.bar(y=means.index, x=means.values, orientation="h", height=350,
title=f"Average Win Percentage ({df.index.min()} to {df.index.max()})",
labels={"x": "Win Percentage", "y": "Team"},
color=means.index, color_discrete_map=team_colors_map
)
```


We can also calculate and visualize moving averages to get a smoother trend of each team's performance over time:

```{python}
window = 20
ma_df = DataFrame()
for team_name in df.columns:
moving_avg = df[team_name].rolling(window=window).mean()
#ma_df[f"{team_name}_ma_{window}"] = moving_avg
ma_df[team_name] = moving_avg
```

```{python}
px.line(ma_df, y=ma_df.columns.tolist(), height=450,
title=f"Baseball Team Win Percentages ({window} Year Moving Avg)",
labels={"value": "Win Percentage", "variable": "Team"},
color_discrete_map=team_colors_map
)
```

#### Calculating Autocorrelation

OK, sure we can analyze which teams do better on average, and how well each team performs over time, but with autocorrelation we care about how consistent each team's results are with its past performance (or put another way, how consistent each team's future results will be with its current performance).

Calculating autocorrelation of performance for each team (using the same number of lagging periods for each team):

```{python}
from statsmodels.tsa.stattools import acf
n_lags=10
acf_df = DataFrame()
for team_name in df.columns:
acf_results = acf(df[team_name], nlags=n_lags, fft=True, missing="drop")
acf_df[team_name] = acf_results
```


Plotting the autocorrelation results on a graph helps us compare the results for each team:

```{python}
px.line(acf_df, y=["NYY", "BOS", "BAL", "TOR"], markers="O", height=450,
title="Auto-correlation of Annual Baseball Team Performance",
labels={"variable": "Team", "value": "Autocorrelation",
"index": "Number of lags"
},
color_discrete_map=team_colors_map
)
```

The autocorrelation results help us understand the consistency in performance of each team from year to year.

For each team, how correlated is its performance from one year to the next year? How about two years out? How about three years out?

### Autocorrelation of Baseball Team Performance
Which team is most consistent in their performance from year to year, over the entire 10-year period?
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ lxml # bs4 needs this to parse XML
scipy


openpyxl # for pandas.read_excel
pandas_datareader
yahooquery
yfinance
Expand Down

0 comments on commit 4d32d5f

Please sign in to comment.