misc

jstac · jstac · commit 47a2a3d9dd6d · 2023-05-17T09:45:09.000+10:00
diff --git a/lectures/mle.md b/lectures/mle.md
@@ -24,8 +24,8 @@ from math import exp, log
 
 ## Introduction
 
-Consider a situation where a policymaker is trying to estimate how much revenue a proposed wealth tax
-will raise.
+Consider a situation where a policymaker is trying to estimate how much revenue
+a proposed wealth tax will raise.
 
 The proposed tax is 
 
@@ -39,12 +39,15 @@ $$
 
 where $w$ is wealth.
 
-For example, if $a = 0.05$, $b = 0.1$, and $\bar w = 2.5$, this means a
-5% tax on wealth up to 2.5 and 10% tax on wealth in excess of 2.5.
 
-(The units will be 100,000 so 2.5 means 250,000 dollars.)
+For example, if $a = 0.05$, $b = 0.1$, and $\bar w = 2.5$, this means 
 
-Here we define $h$
+* a 5% tax on wealth up to 2.5 and 
+* a 10% tax on wealth in excess of 2.5.
+
+The unit is 100,000, so $w= 2.5$ means 250,000 dollars.
+
+Let's go ahead and define $h$:
 
 ```{code-cell} ipython3
 def h(w, a=0.05, b=0.1, w_bar=2.5):
@@ -54,18 +57,21 @@ def h(w, a=0.05, b=0.1, w_bar=2.5):
         return a * w_bar + b * (w - w_bar)
 ```
 
-For a population of size $N$, where individual $i$ has wealth $w_i$, the total revenue will be given by
+For a population of size $N$, where individual $i$ has wealth $w_i$, total revenue raised by 
+the tax will be 
 
 $$
     T = \sum_{i=1}^{N} h(w_i)
 $$
 
-However, in most countries wealth is not observed for all individuals.
+We wish to calculate this quantity.
+
+The problem we face is that, in most countries, wealth is not observed for all individuals.
 
 Collecting and maintaining accurate wealth data for all individuals or households in a country
 is just too hard.
 
-So let's suppose instead that we obtain a sample $w_1, w_2, \cdots, w_n$ telling us the wealth of $n$ individuals.
+So let's suppose instead that we obtain a sample $w_1, w_2, \cdots, w_n$ telling us the wealth of $n$ randomly selected individuals.
 
 For our exercise we are going to use a sample of $n = 10,000$ observations from wealth data in the US in 2016.
 
@@ -140,15 +146,15 @@ Maximum likelihood estimation has two steps:
 2. Estimate the parameter values (e.g., estimate $\mu$ and $\sigma$ for the
    normal distribution)
 
-One reasonable assumption for the wealth is that each
-that $w_i$ is [log-normally distributed](https://en.wikipedia.org/wiki/Log-normal_distribution),
+One possible assumption for the wealth is that each
+$w_i$ is [log-normally distributed](https://en.wikipedia.org/wiki/Log-normal_distribution),
 with parameters $\mu \in (-\infty,\infty)$ and $\sigma \in (0,\infty)$.
 
-This means that $\ln w_i$ is normally distributed with mean $\mu$ and
-   standard deviation $\sigma$.
+(This means that $\ln w_i$ is normally distributed with mean $\mu$ and standard deviation $\sigma$.)
 
-You can see that this is a reasonable assumption because if we histogram log wealth
-instead of wealth the picture starts to look something like a bell-shaped curve.
+You can see that this assumption is not completely unreasonable because, if we
+histogram log wealth instead of wealth, the picture starts to look something
+like a bell-shaped curve.
 
 ```{code-cell} ipython3
 ln_sample = np.log(sample)
@@ -157,59 +163,69 @@ ax.hist(ln_sample, density=True, bins=200, histtype='stepfilled', alpha=0.8)
 plt.show()
 ```
 
-+++ {"user_expressions": []}
-
 Now our job is to obtain the maximum likelihood estimates of $\mu$ and $\sigma$, which
 we denote by $\hat{\mu}$ and $\hat{\sigma}$.
 
 These estimates can be found by maximizing the likelihood function given the
 data.
 
 The pdf of a lognormally distributed random variable $X$ is given by:
-$$
-f(x) = \frac{1}{x}\frac{1}{\sigma \sqrt{2\pi}} exp\left(\frac{-1}{2}\left(\frac{\ln x-\mu}{\sigma}\right)\right)^2
-$$
 
-Since $\ln X$ is normally distributed this is the same as
 $$
-f(x) = \frac{1}{x} \phi(x)
+    f(x, \mu, \sigma) 
+    = \frac{1}{x}\frac{1}{\sigma \sqrt{2\pi}} 
+    \exp\left(\frac{-1}{2}\left(\frac{\ln x-\mu}{\sigma}\right)\right)^2
 $$
-where $\phi$ is the pdf of $\ln X$ which is normally distibuted with mean $\mu$ and variance $\sigma ^2$.
 
-For a sample $x = (x_1, x_2, \cdots, x_n)$ the _likelihood function_ is given by:
+For our sample $w_1, w_2, \cdots, w_n$, the [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function) is given by
+
 $$
 \begin{aligned}
-L(\mu, \sigma | x_i) = \prod_{i=1}^{n} f(\mu, \sigma | x_i) \\
-L(\mu, \sigma | x_i) = \prod_{i=1}^{n} \frac{1}{x_i} \phi(\ln x_i)
+    L(\mu, \sigma | w_i) = \prod_{i=1}^{n} f(w_i, \mu, \sigma) \\
 \end{aligned}
 $$
 
-Taking $\log$ on both sides gives us the _log likelihood function_ which is:
+The likelihood function can be viewed as both
+
+* the joint distribution of the sample (which is assumed to be IID) and
+* the "likelihood" of parameters $(\mu, \sigma)$ given the data.
+
+Taking logs on both sides gives us the log likelihood function, which is
+
 $$
 \begin{aligned}
-l(\mu, \sigma | x_i) = -\sum_{i=1}^{n} \ln x_i + \sum_{i=1}^n \phi(\ln x_i) \\
-l(\mu, \sigma | x_i) = -\sum_{i=1}^{n} \ln x_i - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln \sigma^2 - \frac{1}{2\sigma^2}
-\sum_{i=1}^n (\ln x_i - \mu)^2
+    \ell(\mu, \sigma | w_i) 
+    & = \ln \left[ \prod_{i=1}^{n} f(w_i, \mu, \sigma) \right] \\
+    & = -\sum_{i=1}^{n} \ln w_i 
+        - \frac{n}{2} \ln(2\pi) - \frac{n}{2} \ln \sigma^2 - \frac{1}{2\sigma^2}
+            \sum_{i=1}^n (\ln w_i - \mu)^2
 \end{aligned}
 $$
 
 To find where this function is maximised we find its partial derivatives wrt $\mu$ and $\sigma ^2$ and equate them to $0$.
 
-Let's first find the MLE of $\mu$,
+Let's first find the maximum likelihood estimate (MLE) of $\mu$
+
 $$
 \begin{aligned}
-\frac{\delta l}{\delta \mu} = - \frac{1}{2\sigma^2} \times 2 \sum_{i=1}^n (\ln x_i - \mu) = 0 \\
-\Rightarrow \sum_{i=1}^n \ln x_i - n \mu = 0 \\
-\Rightarrow \hat{\mu} = \frac{\sum_{i=1}^n \ln x_i}{n}
+\frac{\delta \ell}{\delta \mu} 
+    = - \frac{1}{2\sigma^2} \times 2 \sum_{i=1}^n (\ln w_i - \mu) = 0 \\
+\implies \sum_{i=1}^n \ln w_i - n \mu = 0 \\
+\implies \hat{\mu} = \frac{\sum_{i=1}^n \ln w_i}{n}
 \end{aligned}
 $$
 
-Now let's find the MLE of $\sigma$,
+Now let's find the MLE of $\sigma$
+
 $$
 \begin{aligned}
-\frac{\delta l}{\delta \sigma^2} = - \frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (\ln x_i - \mu)^2 = 0 \\
-\Rightarrow \frac{n}{2\sigma^2} = \frac{1}{2\sigma^4} \sum_{i=1}^n (\ln x_i - \mu)^2 \\
-\Rightarrow \hat{\sigma} = \left( \frac{\sum_{i=1}^{n}(\ln x_i - \hat{\mu})^2}{n} \right)^{1/2}
+\frac{\delta \ell}{\delta \sigma^2} 
+    = - \frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} 
+    \sum_{i=1}^n (\ln w_i - \mu)^2 = 0 \\
+    \implies \frac{n}{2\sigma^2} = 
+    \frac{1}{2\sigma^4} \sum_{i=1}^n (\ln w_i - \mu)^2 \\
+    \implies \hat{\sigma} = 
+    \left( \frac{\sum_{i=1}^{n}(\ln w_i - \hat{\mu})^2}{n} \right)^{1/2}
 \end{aligned}
 $$
 
@@ -242,7 +258,7 @@ ax.legend()
 plt.show()
 ```
 
-Our estimated lognormal distribution appears to be a decent fit for the overall data.
+Our estimated lognormal distribution appears to be a reasonable fit for the overall data.
 
 We now use {eq}`eq:est_rev` to calculate total revenue.
 
@@ -331,8 +347,6 @@ ax.legend()
 plt.show()
 ```
 
-+++ {"user_expressions": []}
-
 We observe that in this case the fit for the Pareto distribution is not very
 good, so we can probably reject it.
 
@@ -417,8 +431,6 @@ ax.plot(x, dist_pareto_tail.pdf(x), 'k-', lw=0.5, label='pareto pdf')
 plt.show()
 ```
 
-+++ {"user_expressions": []}
-
 The Pareto distribution is a better fit for the right hand tail of our dataset.
 
 ### So what is the best distribution?
@@ -432,7 +444,8 @@ One test is to plot the data against the fitted distribution, as we did.
 
 There are other more rigorous tests, such as the [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).
 
-We omit the details.
+We omit such advanced topics (but encourage readers to study them once
+they have completed these lectures).
 
 ## Exercises
 
@@ -469,8 +482,6 @@ tr_expo = total_revenue(dist_exp)
 tr_expo
 ```
 
-+++ {"user_expressions": []}
-
 ```{solution-end}
 ```
 
@@ -498,7 +509,6 @@ ax.legend()
 plt.show()
 ```
 
-+++ {"user_expressions": []}
 
 Clearly, this distribution is not a good fit for our data.