You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this lecture we give a quick introduction to data and probability distributions using Python
19
21
20
-
21
22
```{code-cell} ipython3
22
23
!pip install --upgrade yfinance
23
24
```
@@ -28,8 +29,11 @@ import pandas as pd
28
29
import numpy as np
29
30
import yfinance as yf
30
31
import scipy.stats
32
+
import seaborn as sns
31
33
```
32
34
35
+
+++ {"user_expressions": []}
36
+
33
37
## Common distributions
34
38
35
39
In this section we recall the definitions of some well-known distributions and show how to manipulate them with SciPy.
@@ -85,6 +89,8 @@ n = 10
85
89
u = scipy.stats.randint(1, n+1)
86
90
```
87
91
92
+
+++ {"user_expressions": []}
93
+
88
94
Here's the mean and variance
89
95
90
96
```{code-cell} ipython3
@@ -95,6 +101,8 @@ u.mean()
95
101
u.var()
96
102
```
97
103
104
+
+++ {"user_expressions": []}
105
+
98
106
Now let's evaluate the PMF
99
107
100
108
```{code-cell} ipython3
@@ -105,6 +113,8 @@ u.pmf(1)
105
113
u.pmf(2)
106
114
```
107
115
116
+
+++ {"user_expressions": []}
117
+
108
118
Here's a plot of the probability mass function:
109
119
110
120
```{code-cell} ipython3
@@ -116,6 +126,8 @@ ax.set_xticks(S)
116
126
plt.show()
117
127
```
118
128
129
+
+++ {"user_expressions": []}
130
+
119
131
Here's a plot of the CDF:
120
132
121
133
```{code-cell} ipython3
@@ -127,17 +139,19 @@ ax.set_xticks(S)
127
139
plt.show()
128
140
```
129
141
142
+
+++ {"user_expressions": []}
143
+
130
144
The CDF jumps up by $p(x_i)$ and $x_i$.
131
145
132
-
+++
146
+
+++ {"user_expressions": []}
133
147
134
148
#### Exercise
135
149
136
150
Calculate the mean and variance directly from the PMF, using the expressions given above.
137
151
138
152
Check that your answers agree with `u.mean()` and `u.var()`.
139
153
140
-
+++
154
+
+++ {"user_expressions": []}
141
155
142
156
#### Binomial distribution
143
157
@@ -174,6 +188,8 @@ ax.set_xticks(S)
174
188
plt.show()
175
189
```
176
190
191
+
+++ {"user_expressions": []}
192
+
177
193
Here's the CDF
178
194
179
195
```{code-cell} ipython3
@@ -185,19 +201,22 @@ ax.set_xticks(S)
185
201
plt.show()
186
202
```
187
203
204
+
+++ {"user_expressions": []}
205
+
188
206
#### Exercise
189
207
190
208
Using `u.pmf`, check that our definition of the CDF given above calculates the same function as `u.cdf`.
191
209
192
-
+++
210
+
+++ {"user_expressions": []}
193
211
194
212
#### Poisson distribution
195
213
196
-
+++
214
+
215
+
+++ {"user_expressions": []}
197
216
198
217
## Continuous distributions
199
218
200
-
+++
219
+
+++ {"user_expressions": []}
201
220
202
221
Continuous distributions are represented by a **density function**, which is a function $p$ over $\mathbb R$ (the set of all numbers) such that $p(x) \geq 0$ for all $x$ and
203
222
@@ -226,7 +245,7 @@ $$
226
245
= \int_{-\infty}^y p(y) dy
227
246
$$
228
247
229
-
+++
248
+
+++ {"user_expressions": []}
230
249
231
250
#### Normal distribution
232
251
@@ -252,6 +271,8 @@ u = scipy.stats.norm(μ, σ)
252
271
u.mean(), u.var()
253
272
```
254
273
274
+
+++ {"user_expressions": []}
275
+
255
276
Here's a plot of the density --- the famous "bell-shaped curve":
In this situation, we might refer to the set of their incomes as the "income distribution."
310
337
311
338
The terminology is confusing because this is not the same thing as a probability distribution --- it's just a collection of numbers.
@@ -314,7 +341,7 @@ Below we explore some observed distributions.
314
341
315
342
We will see that there are connections between observed distributions---like the income distribution above---and probability distributions, as we'll see below.
316
343
317
-
+++
344
+
+++ {"user_expressions": []}
318
345
319
346
### Summary statistics
320
347
@@ -332,7 +359,7 @@ $$
332
359
\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2
333
360
$$
334
361
335
-
+++
362
+
+++ {"user_expressions": []}
336
363
337
364
For the income distribution given above, we can calculate these numbers via
338
365
@@ -344,11 +371,13 @@ x = np.asarray(df['income'])
344
371
x.mean(), x.var()
345
372
```
346
373
374
+
+++ {"user_expressions": []}
375
+
347
376
#### Exercise
348
377
349
378
Check that the formulas given above produce the same numbers.
350
379
351
-
+++
380
+
+++ {"user_expressions": []}
352
381
353
382
### Visualization
354
383
@@ -360,11 +389,11 @@ We will cover
360
389
- kernel density estimates and
361
390
- violin plots
362
391
363
-
+++
392
+
+++ {"user_expressions": []}
364
393
365
394
#### Histograms
366
395
367
-
+++
396
+
+++ {"user_expressions": []}
368
397
369
398
We can histogram the income distribution we just constructed as follows
The match between the histogram and the density is not very bad but also not very good.
469
543
470
544
One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about heavy tailed distributions in TODO add link.
471
545
472
-
473
-
+++
546
+
+++ {"user_expressions": []}
474
547
475
548
Of course, if the data really *is* generated by the normal distribution, then the fit will be better.
0 commit comments