You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/prob_dist.md
+28-19Lines changed: 28 additions & 19 deletions
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,13 @@ kernelspec:
15
15
16
16
# Distributions and Probabilities
17
17
18
+
```{index} single: Distributions and Probabilities
19
+
```
20
+
21
+
```{contents} Contents
22
+
:depth: 2
23
+
```
24
+
18
25
## Outline
19
26
20
27
In this lecture we give a quick introduction to data and probability distributions using Python
@@ -162,12 +169,12 @@ Check that your answers agree with `u.mean()` and `u.var()`.
162
169
Another useful (and more interesting) distribution is the **binomial distribution** on $S=\{0, \ldots, n\}$, which has PMF
163
170
164
171
$$
165
-
p(i) = \binom{i}{n} \theta^i (1-\theta)^{n-i}
172
+
p(i) = \binom{n}{i} \theta^i (1-\theta)^{n-i}
166
173
$$
167
174
168
175
Here $\theta \in [0,1]$ is a parameter.
169
176
170
-
The interpretatin of $p(i)$ is: the number of successes in $n$ independent trials with success probability $\theta$.
177
+
The interpretation of $p(i)$ is: the number of successes in $n$ independent trials with success probability $\theta$.
171
178
172
179
(If $\theta=0.5$, this is "how many heads in $n$ flips of a fair coin")
173
180
@@ -272,7 +279,7 @@ plt.show()
272
279
273
280
Continuous distributions are represented by a **density function**, which is a function $p$ over $\mathbb R$ (the set of all numbers) such that $p(x) \geq 0$ for all $x$ and
274
281
275
-
$$ \int_{-\infty}^\infty p(x) = 1 $$
282
+
$$ \int_{-\infty}^\infty p(x) dx = 1 $$
276
283
277
284
We say that random variable $X$ has distribution $p$ if
278
285
@@ -294,14 +301,14 @@ The **cumulative distribution function** (CDF) of $X$ is defined by
294
301
295
302
$$
296
303
F(x) = \mathbb P\{X \leq x\}
297
-
= \int_{-\infty}^y p(y) dy
304
+
= \int_{-\infty}^x p(x) dx
298
305
$$
299
306
300
307
+++ {"user_expressions": []}
301
308
302
309
#### Normal distribution
303
310
304
-
Perhaps the most famous distribution is the **normal distribution**, which as density
311
+
Perhaps the most famous distribution is the **normal distribution**, which has density
305
312
306
313
$$
307
314
p(x) = \frac{1}{\sqrt{2\pi}\sigma}
@@ -312,7 +319,7 @@ This distribution has two parameters, $\mu$ and $\sigma$.
312
319
313
320
It can be shown that, for this distribution, the mean is $\mu$ and the variance is $\sigma^2$.
314
321
315
-
We can obtain the moments, PDF, CDF of the normal density via SciPy as follows:
322
+
We can obtain the moments, PDF, and CDF of the normal density as follows:
316
323
317
324
```{code-cell} ipython3
318
325
μ, σ = 0.0, 1.0
@@ -376,7 +383,7 @@ It has a nice interpretation: if $X$ is lognormally distributed, then $\log X$ i
376
383
377
384
It is often used to model variables that are "multiplicative" in nature, such as income or asset prices.
378
385
379
-
We can obtain the moments, PDF, CDF of the normal density via SciPy as follows:
386
+
We can obtain the moments, PDF, and CDF of the normal density as follows:
380
387
381
388
```{code-cell} ipython3
382
389
μ, σ = 0.0, 1.0
@@ -390,10 +397,9 @@ u.mean(), u.var()
390
397
```{code-cell} ipython3
391
398
μ_vals = [-1, 0, 1]
392
399
σ_vals = [0.25, 0.5, 1]
393
-
fig, ax = plt.subplots()
394
-
395
400
x_grid = np.linspace(0, 3, 200)
396
401
402
+
fig, ax = plt.subplots()
397
403
for μ, σ in zip(μ_vals, σ_vals):
398
404
u = scipy.stats.lognorm(σ, scale=np.exp(μ))
399
405
ax.plot(x_grid, u.pdf(x_grid),
@@ -432,7 +438,7 @@ It is related to the Poisson distribution as it describes the distribution of th
432
438
433
439
It can be shown that, for this distribution, the mean is $1/\lambda$ and the variance is $1/\lambda^2$.
434
440
435
-
We can obtain the moments, PDF, CDF of the normal density via SciPy as follows:
441
+
We can obtain the moments, PDF, and CDF of the normal density as follows:
436
442
437
443
```{code-cell} ipython3
438
444
λ = 1.0
@@ -446,6 +452,8 @@ u.mean(), u.var()
446
452
```{code-cell} ipython3
447
453
fig, ax = plt.subplots()
448
454
λ_vals = [0.5, 1, 2]
455
+
x_grid = np.linspace(0, 6, 200)
456
+
449
457
for λ in λ_vals:
450
458
u = scipy.stats.expon(scale=1/λ)
451
459
ax.plot(x_grid, u.pdf(x_grid),
@@ -486,12 +494,13 @@ For example, if $\alpha = \beta = 1$, then the beta distribution is uniform on $
486
494
487
495
While, if $\alpha = 3$ and $\beta = 2$, then the beta distribution is located more towards 1 as there are more successes than failures.
488
496
489
-
It can be shown that, for this distribution, the mean is $\alpha / (\alpha + \beta)$ and the variance is $\alpha \beta / (\alpha + \beta)^2 (\alpha + \beta + 1)$.
497
+
It can be shown that, for this distribution, the mean is $\alpha / (\alpha + \beta)$ and
498
+
the variance is $\alpha \beta / (\alpha + \beta)^2 (\alpha + \beta + 1)$.
490
499
491
-
We can obtain the moments, PDF, CDF of the normal density via SciPy as follows:
500
+
We can obtain the moments, PDF, and CDF of the normal density as follows:
492
501
493
502
```{code-cell} ipython3
494
-
α, β = 1.0, 1.0
503
+
α, β = 3.0, 1.0
495
504
u = scipy.stats.beta(α, β)
496
505
```
497
506
@@ -500,8 +509,8 @@ u.mean(), u.var()
500
509
```
501
510
502
511
```{code-cell} ipython3
503
-
α_vals = [0.5, 1, 50, 250, 3]
504
-
β_vals = [3, 1, 100, 200, 1]
512
+
α_vals = [0.5, 1, 5, 25, 3]
513
+
β_vals = [3, 1, 10, 20, 0.5]
505
514
x_grid = np.linspace(0, 1, 200)
506
515
507
516
fig, ax = plt.subplots()
@@ -541,10 +550,10 @@ It can be shown that, for this distribution, the mean is $\alpha / \beta$ and th
541
550
542
551
One interpretation is that if $X$ is gamma distributed, then $X$ is the sum of $\alpha$ independent exponentially distributed random variables with mean $1/\beta$.
543
552
544
-
We can obtain the moments, PDF, CDF of the normal density via SciPy as follows:
553
+
We can obtain the moments, PDF, and CDF of the normal density as follows:
545
554
546
555
```{code-cell} ipython3
547
-
α, β = 1.0, 1.0
556
+
α, β = 3.0, 2.0
548
557
u = scipy.stats.gamma(α, scale=1/β)
549
558
```
550
559
@@ -742,7 +751,7 @@ plt.show()
742
751
743
752
When we use a larger bandwidth, the KDE is smoother.
744
753
745
-
A suitable bandwith is the one that is not too smooth (underfitting) or too wiggly (overfitting).
754
+
A suitable bandwidth is not too smooth (underfitting) or too wiggly (overfitting).
746
755
747
756
748
757
#### Violin plots
@@ -813,7 +822,7 @@ plt.show()
813
822
814
823
The match between the histogram and the density is not very bad but also not very good.
815
824
816
-
One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about heavy tailed distributions in TODO add link.
825
+
One reason is that the normal distribution is not really a good fit for this observed data --- we will discuss this point again when we talk about {ref}`heavy tailed distributions<heavy_tail>`.
0 commit comments