Skip to content

Commit 07ab6a3

Browse files
committed
clear TODOs
1 parent c1b5199 commit 07ab6a3

File tree

2 files changed

+101
-44
lines changed

2 files changed

+101
-44
lines changed

lectures/distributions.md

Lines changed: 0 additions & 14 deletions
This file was deleted.

lectures/prob_dist.md

Lines changed: 101 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,14 @@ kernelspec:
1111
name: python3
1212
---
1313

14+
+++ {"user_expressions": []}
15+
1416
# Distributions and Probabilities
1517

1618
## Outline
1719

1820
In this lecture we give a quick introduction to data and probability distributions using Python
1921

20-
2122
```{code-cell} ipython3
2223
!pip install --upgrade yfinance
2324
```
@@ -28,8 +29,11 @@ import pandas as pd
2829
import numpy as np
2930
import yfinance as yf
3031
import scipy.stats
32+
import seaborn as sns
3133
```
3234

35+
+++ {"user_expressions": []}
36+
3337
## Common distributions
3438

3539
In this section we recall the definitions of some well-known distributions and show how to manipulate them with SciPy.
@@ -85,6 +89,8 @@ n = 10
8589
u = scipy.stats.randint(1, n+1)
8690
```
8791

92+
+++ {"user_expressions": []}
93+
8894
Here's the mean and variance
8995

9096
```{code-cell} ipython3
@@ -95,6 +101,8 @@ u.mean()
95101
u.var()
96102
```
97103

104+
+++ {"user_expressions": []}
105+
98106
Now let's evaluate the PMF
99107

100108
```{code-cell} ipython3
@@ -105,6 +113,8 @@ u.pmf(1)
105113
u.pmf(2)
106114
```
107115

116+
+++ {"user_expressions": []}
117+
108118
Here's a plot of the probability mass function:
109119

110120
```{code-cell} ipython3
@@ -116,6 +126,8 @@ ax.set_xticks(S)
116126
plt.show()
117127
```
118128

129+
+++ {"user_expressions": []}
130+
119131
Here's a plot of the CDF:
120132

121133
```{code-cell} ipython3
@@ -127,17 +139,19 @@ ax.set_xticks(S)
127139
plt.show()
128140
```
129141

142+
+++ {"user_expressions": []}
143+
130144
The CDF jumps up by $p(x_i)$ and $x_i$.
131145

132-
+++
146+
+++ {"user_expressions": []}
133147

134148
#### Exercise
135149

136150
Calculate the mean and variance directly from the PMF, using the expressions given above.
137151

138152
Check that your answers agree with `u.mean()` and `u.var()`.
139153

140-
+++
154+
+++ {"user_expressions": []}
141155

142156
#### Binomial distribution
143157

@@ -174,6 +188,8 @@ ax.set_xticks(S)
174188
plt.show()
175189
```
176190

191+
+++ {"user_expressions": []}
192+
177193
Here's the CDF
178194

179195
```{code-cell} ipython3
@@ -185,19 +201,22 @@ ax.set_xticks(S)
185201
plt.show()
186202
```
187203

204+
+++ {"user_expressions": []}
205+
188206
#### Exercise
189207

190208
Using `u.pmf`, check that our definition of the CDF given above calculates the same function as `u.cdf`.
191209

192-
+++
210+
+++ {"user_expressions": []}
193211

194212
#### Poisson distribution
195213

196-
+++
214+
215+
+++ {"user_expressions": []}
197216

198217
## Continuous distributions
199218

200-
+++
219+
+++ {"user_expressions": []}
201220

202221
Continuous distributions are represented by a **density function**, which is a function $p$ over $\mathbb R$ (the set of all numbers) such that $p(x) \geq 0$ for all $x$ and
203222

@@ -226,7 +245,7 @@ $$
226245
= \int_{-\infty}^y p(y) dy
227246
$$
228247

229-
+++
248+
+++ {"user_expressions": []}
230249

231250
#### Normal distribution
232251

@@ -252,6 +271,8 @@ u = scipy.stats.norm(μ, σ)
252271
u.mean(), u.var()
253272
```
254273

274+
+++ {"user_expressions": []}
275+
255276
Here's a plot of the density --- the famous "bell-shaped curve":
256277

257278
```{code-cell} ipython3
@@ -261,6 +282,8 @@ ax.plot(x_grid, u.pdf(x_grid))
261282
plt.show()
262283
```
263284

285+
+++ {"user_expressions": []}
286+
264287
Here's a plot of the CDF:
265288

266289
```{code-cell} ipython3
@@ -270,21 +293,23 @@ ax.set_ylim(0, 1)
270293
plt.show()
271294
```
272295

296+
+++ {"user_expressions": []}
297+
273298
#### Lognormal distribution
274299

275-
+++
300+
+++ {"user_expressions": []}
276301

277302
#### Exponential distribution
278303

279-
+++
304+
+++ {"user_expressions": []}
280305

281306
#### Beta distribution
282307

283-
+++
308+
+++ {"user_expressions": []}
284309

285310
## Observed distributions
286311

287-
+++
312+
+++ {"user_expressions": []}
288313

289314
Sometimes we refer to observed data or measurements as "distributions".
290315

@@ -306,6 +331,8 @@ df = pd.DataFrame(data, columns=['name', 'income'])
306331
df
307332
```
308333

334+
+++ {"user_expressions": []}
335+
309336
In this situation, we might refer to the set of their incomes as the "income distribution."
310337

311338
The terminology is confusing because this is not the same thing as a probability distribution --- it's just a collection of numbers.
@@ -314,7 +341,7 @@ Below we explore some observed distributions.
314341

315342
We will see that there are connections between observed distributions---like the income distribution above---and probability distributions, as we'll see below.
316343

317-
+++
344+
+++ {"user_expressions": []}
318345

319346
### Summary statistics
320347

@@ -332,7 +359,7 @@ $$
332359
\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2
333360
$$
334361

335-
+++
362+
+++ {"user_expressions": []}
336363

337364
For the income distribution given above, we can calculate these numbers via
338365

@@ -344,11 +371,13 @@ x = np.asarray(df['income'])
344371
x.mean(), x.var()
345372
```
346373

374+
+++ {"user_expressions": []}
375+
347376
#### Exercise
348377

349378
Check that the formulas given above produce the same numbers.
350379

351-
+++
380+
+++ {"user_expressions": []}
352381

353382
### Visualization
354383

@@ -360,11 +389,11 @@ We will cover
360389
- kernel density estimates and
361390
- violin plots
362391

363-
+++
392+
+++ {"user_expressions": []}
364393

365394
#### Histograms
366395

367-
+++
396+
+++ {"user_expressions": []}
368397

369398
We can histogram the income distribution we just constructed as follows
370399

@@ -375,6 +404,8 @@ ax.hist(x, bins=5, density=True, histtype='bar')
375404
plt.show()
376405
```
377406

407+
+++ {"user_expressions": []}
408+
378409
Let's look at a distribution from real data.
379410

380411
In particular, we will look at the monthly return on Amazon shares between 2000/1/1 and 2023/1/1.
@@ -390,12 +421,16 @@ data = prices.pct_change()[1:] * 100
390421
data.head()
391422
```
392423

424+
+++ {"user_expressions": []}
425+
393426
The first observation is the monthly return (percent change) over January 2000, which was
394427

395428
```{code-cell} ipython3
396429
data[0]
397430
```
398431

432+
+++ {"user_expressions": []}
433+
399434
Let's turn the return observations into an array and histogram it.
400435

401436
```{code-cell} ipython3
@@ -408,15 +443,39 @@ ax.hist(x_amazon, bins=20)
408443
plt.show()
409444
```
410445

446+
+++ {"user_expressions": []}
447+
411448
#### Kernel density estimates
412449

413-
TODO
450+
Kernel density estimate (KDE) is a non-parametric way to estimate and visualize the PDF of a distribution.
451+
452+
KDE will generate a smooth curve that approximates the PDF.
453+
454+
```{code-cell} ipython3
455+
fig, ax = plt.subplots()
456+
sns.kdeplot(x_amazon, ax=ax)
457+
plt.show()
458+
```
459+
460+
The smoothness of the KDE is dependent on how we choose the bandwidth.
461+
462+
```{code-cell} ipython3
463+
fig, ax = plt.subplots()
464+
sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.1, alpha=0.5, label="bw=0.1")
465+
sns.kdeplot(x_amazon, ax=ax, bw_adjust=0.5, alpha=0.5, label="bw=0.5")
466+
sns.kdeplot(x_amazon, ax=ax, bw_adjust=1, alpha=0.5, label="bw=1")
467+
plt.legend()
468+
plt.show()
469+
```
470+
471+
When we use a larger bandwidth, the KDE is smoother.
472+
473+
A suitable bandwith is the one that is not too smooth (underfitting) or too wiggly (overfitting).
414474

415-
+++
416475

417476
#### Violin plots
418477

419-
+++
478+
+++ {"user_expressions": []}
420479

421480
Yet another way to display an observed distribution is via a violin plot.
422481

@@ -426,17 +485,30 @@ ax.violinplot(x_amazon)
426485
plt.show()
427486
```
428487

488+
+++ {"user_expressions": []}
489+
429490
Violin plots are particularly useful when we want to compare different distributions.
430491

431-
For example, let's compare the monthly returns on Amazon shares with the monthly return on
492+
For example, let's compare the monthly returns on Amazon shares with the monthly return on Apple shares.
432493

433-
TODO complete
494+
```{code-cell} ipython3
495+
df = yf.download('AAPL', '2000-1-1', '2023-1-1', interval='1mo' )
496+
prices = df['Adj Close']
497+
data = prices.pct_change()[1:] * 100
498+
x_apple = np.asarray(data)
499+
```
434500

435-
+++
501+
```{code-cell} ipython3
502+
fig, ax = plt.subplots()
503+
ax.violinplot([x_amazon, x_apple])
504+
plt.show()
505+
```
506+
507+
+++ {"user_expressions": []}
436508

437509
### Connection to probability distributions
438510

439-
+++
511+
+++ {"user_expressions": []}
440512

441513
Let's discuss the connection between observed distributions and probability distributions.
442514

@@ -465,12 +537,13 @@ ax.hist(x_amazon, density=True, bins=40)
465537
plt.show()
466538
```
467539

540+
+++ {"user_expressions": []}
541+
468542
The match between the histogram and the density is not very bad but also not very good.
469543

470544
One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about heavy tailed distributions in TODO add link.
471545

472-
473-
+++
546+
+++ {"user_expressions": []}
474547

475548
Of course, if the data really *is* generated by the normal distribution, then the fit will be better.
476549

@@ -491,10 +564,8 @@ ax.hist(x_draws, density=True, bins=40)
491564
plt.show()
492565
```
493566

494-
Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.
495-
496-
This convergence is a version of the "law of large numbers", which we will discuss in TODO add link
567+
+++ {"user_expressions": []}
497568

498-
```{code-cell} ipython3
569+
Note that if you keep increasing $N$, which is the number of observations, the fit will get better and better.
499570

500-
```
571+
This convergence is a version of the "law of large numbers", which we will discuss in {ref}`lln_mr`.

0 commit comments

Comments
 (0)