@@ -22,7 +22,6 @@ import pandas as pd
22
22
from math import exp, log
23
23
```
24
24
25
-
26
25
## Introduction
27
26
28
27
Consider a situation where a policymaker is trying to estimate how much revenue a proposed wealth tax
@@ -78,15 +77,16 @@ The data is derived from the
78
77
[ Survey of Consumer Finances] ( https://en.wikipedia.org/wiki/Survey_of_Consumer_Finances ) (SCF).
79
78
80
79
81
- The following code imports this data and reads it into an array called ` sample ` .
80
+ The following code imports this data and reads it into an array called ` sample ` .
82
81
83
82
``` {code-cell} ipython3
84
83
:tags: [hide-input]
84
+
85
85
url = 'https://media.githubusercontent.com/media/QuantEcon/high_dim_data/update_scf_noweights/SCF_plus/SCF_plus_mini_no_weights.csv'
86
86
df = pd.read_csv(url)
87
87
df = df.dropna()
88
88
df = df[df['year'] == 2016]
89
- df = df.loc[df['n_wealth'] > 0 ] #restrcting data to net worth > 0
89
+ df = df.loc[df['n_wealth'] > 1 ] #restrcting data to net worth > 1
90
90
rv = df['n_wealth'].sample(n=n, random_state=1234)
91
91
rv = rv.to_numpy() / 100_000
92
92
sample = rv
@@ -157,22 +157,64 @@ ax.hist(ln_sample, density=True, bins=200, histtype='stepfilled', alpha=0.8)
157
157
plt.show()
158
158
```
159
159
160
+ +++ {"user_expressions": []}
161
+
160
162
Now our job is to obtain the maximum likelihood estimates of $\mu$ and $\sigma$, which
161
163
we denote by $\hat{\mu}$ and $\hat{\sigma}$.
162
164
163
165
These estimates can be found by maximizing the likelihood function given the
164
166
data.
165
167
166
- In our case they are
168
+ The pdf of a lognormally distributed random variable $X$ is given by:
169
+ $$
170
+ f(x) = \frac{1}{x}\frac{1}{\sigma \sqrt{2\pi}} exp\left(\frac{-1}{2}\left(\frac{\ln x-\mu}{\sigma}\right)\right)^2
171
+ $$
167
172
173
+ Since $\ln X$ is normally distributed this is the same as
168
174
$$
169
- \hat{\mu} = \frac{\sum_{i=1}^{n} \ln w_i}{n}
170
- \quad \text{and} \quad
171
- \hat{\sigma}
172
- = \left( \frac{\sum_{i=1}^{n}(\ln w_i - \hat{\mu})^2}{n} \right)^{1/2}
175
+ f(x) = \frac{1}{x} \phi(x)
176
+ $$
177
+ where $\phi$ is the pdf of $\ln X$ which is normally distibuted with mean $\mu$ and variance $\sigma ^2$.
178
+
179
+ For a sample $x = (x_1, x_2, \cdots, x_n)$ the _likelihood function_ is given by:
180
+ $$
181
+ \begin{aligned}
182
+ L(\mu, \sigma | x_i) = \prod_ {i=1}^{n} f(\mu, \sigma | x_i) \\
183
+ L(\mu, \sigma | x_i) = \prod_ {i=1}^{n} \frac{1}{x_i} \phi(\ln x_i)
184
+ \end{aligned}
185
+ $$
186
+
187
+ Taking $\log$ on both sides gives us the _log likelihood function_ which is:
188
+ $$
189
+ \begin{aligned}
190
+ l(\mu, \sigma | x_i) = -\sum_ {i=1}^{n} \ln x_i + \sum_ {i=1}^n \phi(\ln x_i) \\
191
+ l(\mu, \sigma | x_i) = -\sum_ {i=1}^{n} \ln x_i - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln \sigma^2 - \frac{1}{2\sigma^2}
192
+ \sum_ {i=1}^n (\ln x_i - \mu)^2
193
+ \end{aligned}
173
194
$$
174
195
175
- Let's calculate these values
196
+ To find where this function is maximised we find its partial derivatives wrt $\mu$ and $\sigma ^2$ and equate them to $0$.
197
+
198
+ Let's first find the MLE of $\mu$,
199
+ $$
200
+ \begin{aligned}
201
+ \frac{\delta l}{\delta \mu} = - \frac{1}{2\sigma^2} \times 2 \sum_ {i=1}^n (\ln x_i - \mu) = 0 \\
202
+ \Rightarrow \sum_ {i=1}^n \ln x_i - n \mu = 0 \\
203
+ \Rightarrow \hat{\mu} = \frac{\sum_ {i=1}^n \ln x_i}{n}
204
+ \end{aligned}
205
+ $$
206
+
207
+ Now let's find the MLE of $\sigma$,
208
+ $$
209
+ \begin{aligned}
210
+ \frac{\delta l}{\delta \sigma^2} = - \frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_ {i=1}^n (\ln x_i - \mu)^2 = 0 \\
211
+ \Rightarrow \frac{n}{2\sigma^2} = \frac{1}{2\sigma^4} \sum_ {i=1}^n (\ln x_i - \mu)^2 \\
212
+ \Rightarrow \hat{\sigma} = \left( \frac{\sum_ {i=1}^{n}(\ln x_i - \hat{\mu})^2}{n} \right)^{1/2}
213
+ \end{aligned}
214
+ $$
215
+
216
+ Now that we have derived the expressions for $\hat{\mu}$ and $\hat{\sigma}$,
217
+ let's compute them for our wealth sample.
176
218
177
219
```{code-cell} ipython3
178
220
μ_hat = np.mean(ln_sample)
@@ -200,7 +242,6 @@ ax.legend()
200
242
plt.show()
201
243
```
202
244
203
-
204
245
Our estimated lognormal distribution appears to be a decent fit for the overall data.
205
246
206
247
We now use {eq}`eq:est_rev` to calculate total revenue.
@@ -268,7 +309,6 @@ tr_pareto
268
309
269
310
The number is very different!
270
311
271
-
272
312
```{code-cell} ipython3
273
313
tr_pareto / tr_lognorm
274
314
```
@@ -280,7 +320,6 @@ We see that choosing the right distribution is extremely important.
280
320
Let's compare the fitted Pareto distribution to the histogram:
281
321
282
322
```{code-cell} ipython3
283
-
284
323
fig, ax = plt.subplots()
285
324
ax.set_xlim(-1, 20)
286
325
ax.set_ylim(0,1.75)
@@ -292,68 +331,11 @@ ax.legend()
292
331
plt.show()
293
332
```
294
333
334
+ +++ {"user_expressions": []}
335
+
295
336
We observe that in this case the fit for the Pareto distribution is not very
296
337
good, so we can probably reject it.
297
338
298
-
299
-
300
- ## Exponential distribution
301
-
302
- What other distributions could we try?
303
-
304
- Suppose we assume that the distribution is [exponential](https://en.wikipedia.org/wiki/Exponential_distribution)
305
- with parameter $\lambda > 0$.
306
-
307
- The maximum likelihood estimate of $\lambda$ is given by
308
-
309
- $$
310
- \hat{\lambda} = \frac{n}{\sum_ {i=1}^n w_i}
311
- $$
312
-
313
- Let's calculate it.
314
-
315
- ```{code-cell} ipython3
316
- λ_hat = 1/np.mean(sample)
317
- λ_hat
318
- ```
319
-
320
- Now let's compute total revenue:
321
-
322
- ```{code-cell} ipython3
323
- dist_exp = expon(scale = 1/λ_hat)
324
- tr_expo = total_revenue(dist_exp)
325
- tr_expo
326
- ```
327
-
328
- Again, calculated revenue is very different.
329
-
330
- ```{code-cell} ipython3
331
- tr_expo / tr_lognorm
332
- ```
333
-
334
- But once again, when we plot the fitted distribution against the data we see it
335
- is a bad fit.
336
-
337
-
338
- ```{code-cell} ipython3
339
-
340
- fig, ax = plt.subplots()
341
- ax.set_xlim(-1, 20)
342
-
343
- ax.hist(sample, density=True, bins=5000, histtype='stepfilled', alpha=0.5)
344
- ax.plot(x, dist_exp.pdf(x), 'k-', lw=0.5, label='exponential pdf')
345
- ax.legend()
346
-
347
- plt.show()
348
- ```
349
-
350
- So we can reject this calculation.
351
-
352
-
353
-
354
-
355
-
356
-
357
339
## What is the best distribution?
358
340
359
341
There is no "best" distribution --- every choice we make is an assumption.
@@ -370,6 +352,7 @@ We set an arbitrary threshold of $500,000 and read the data into `sample_tail`.
370
352
371
353
```{code-cell} ipython3
372
354
:tags: [hide-input]
355
+
373
356
df_tail = df.loc[df['n_wealth'] > 500_000 ]
374
357
df_tail.head()
375
358
rv_tail = df_tail['n_wealth'].sample(n=10_000, random_state=4321)
@@ -410,7 +393,6 @@ ax.legend()
410
393
plt.show()
411
394
```
412
395
413
-
414
396
While the lognormal distribution was a good fit for the entire dataset,
415
397
it is not a good fit for the right hand tail.
416
398
@@ -435,6 +417,7 @@ ax.plot(x, dist_pareto_tail.pdf(x), 'k-', lw=0.5, label='pareto pdf')
435
417
plt.show()
436
418
```
437
419
420
+ +++ {"user_expressions": []}
438
421
439
422
The Pareto distribution is a better fit for the right hand tail of our dataset.
440
423
@@ -451,3 +434,73 @@ There are other more rigorous tests, such as the [Kolmogorov-Smirnov test](https
451
434
452
435
We omit the details.
453
436
437
+ ## Exercises
438
+
439
+ ```{exercise-start}
440
+ :label: mle_ex1
441
+ ```
442
+ Suppose we assume wealth is [exponentially](https://en.wikipedia.org/wiki/Exponential_distribution)
443
+ distributed with parameter $\lambda > 0$.
444
+
445
+ The maximum likelihood estimate of $\lambda$ is given by
446
+
447
+ $$
448
+ \hat{\lambda} = \frac{n}{\sum_ {i=1}^n w_i}
449
+ $$
450
+
451
+ 1. Compute $\hat{\lambda}$ for our initial sample.
452
+ 2. Use $\hat{\lambda}$ to find the total revenue
453
+
454
+ ```{exercise-end}
455
+ ```
456
+
457
+ ```{solution-start} mle_ex1
458
+ :class: dropdown
459
+ ```
460
+
461
+ ```{code-cell} ipython3
462
+ λ_hat = 1/np.mean(sample)
463
+ λ_hat
464
+ ```
465
+
466
+ ```{code-cell} ipython3
467
+ dist_exp = expon(scale = 1/λ_hat)
468
+ tr_expo = total_revenue(dist_exp)
469
+ tr_expo
470
+ ```
471
+
472
+ +++ {"user_expressions": []}
473
+
474
+ ```{solution-end}
475
+ ```
476
+
477
+ ```{exercise-start}
478
+ :label: mle_ex2
479
+ ```
480
+
481
+ Plot the exponential distribution against the sample and check if it is a good fit or not.
482
+
483
+ ```{exercise-end}
484
+ ```
485
+
486
+ ```{solution-start} mle_ex2
487
+ :class: dropdown
488
+ ```
489
+
490
+ ```{code-cell} ipython3
491
+ fig, ax = plt.subplots()
492
+ ax.set_xlim(-1, 20)
493
+
494
+ ax.hist(sample, density=True, bins=5000, histtype='stepfilled', alpha=0.5)
495
+ ax.plot(x, dist_exp.pdf(x), 'k-', lw=0.5, label='exponential pdf')
496
+ ax.legend()
497
+
498
+ plt.show()
499
+ ```
500
+
501
+ +++ {"user_expressions": []}
502
+
503
+ Clearly, this distribution is not a good fit for our data.
504
+
505
+ ```{solution-end}
506
+ ```
0 commit comments