iamhectorotero · adamlm · Jul 16, 2020 · Jul 17, 2020
diff --git a/Chapter 2/Exercise 2.6.md b/Chapter 2/Exercise 2.6.md
@@ -2,10 +2,10 @@
 
 ## Question:
 The results shown in Figure 2.3 should be quite reliable because they
-are averages over 2000 individual, randomly chosen 10-armed bandit 
-tasks. Why, then, are there oscillations and spikes in the early part 
-of the curve for the optimistic method? In other words, what might 
-make this method perform particularly better or worse, on average, 
+are averages over 2000 individual, randomly chosen 10-armed bandit
+tasks. Why, then, are there oscillations and spikes in the early part
+of the curve for the optimistic method? In other words, what might
+make this method perform particularly better or worse, on average,
 on particular early steps?
 
 ## Answer:
@@ -16,6 +16,3 @@ and has to try each multiple times before it realises they're actually
 bad. This will depend on how quickly these bad options drop below the
 best option (once that occurs on exploration will reduce them to their
 correct values over time but at a much more reduced rate)
-
-
-
diff --git a/Chapter 2/Exercise 2.7.md b/Chapter 2/Exercise 2.7.md
@@ -1,11 +1,25 @@
 # Exercise 2.7
 
 ## Question:
-Show that in the case of two actions, the soft-max distribution is the same as
-that given by the logistic, or sigmoid, function often used in statistics and
-artificial neural networks.
+In most of this chapter we have used sample averages to estimate action values
+because sample averages do not produce the initial bias that constant step sizes
+do (see the analysis leading to (2.6)). However, sample averages are not a
+completely satisfactory solution because they may perform poorly on
+nonstationary problems. Is it possible to avoid the bias of constant step sizes
+while retaining their advantages on nonstationary problems? One way is to use a
+step size of
 
-## Answer:
+![Exercise 2.7 beta equation](images/Exercise_2_7_beta.png)
+
+to process the nth reward for a particular action, where _α_ > 0 is a
+conventional constant step size, and _ōₙ_ is a trace of one that starts at 0:
+
+![Exercise 2.7 o-bar equation](images/Exercise_2_7_o_bar.png)
 
-![Exercise 2.7 solution](images/Exercise_2_7.png)
+Carry out an analysis like that in (2.6) to show that _Qₙ_ is an exponential
+recency-weighted average _without initial bias_.
+
+## Answer:
 
+![Exercise 2.7 solution page 1](images/Exercise_2_7-1.png)
+![Exercise 2.7 solution page 2](images/Exercise_2_7-2.png)
diff --git a/Chapter 2/Exercise 2.8.md b/Chapter 2/Exercise 2.8.md
@@ -1,33 +1,20 @@
 # Exercise 2.8
 
 ## Question:
-Suppose you face a 2-armed bandit task whose true action values change randomly from time step
-to time step. Specifically, suppose that, for any time step, the true values of action 1 and 2
-are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5
-(case B). If you are not able to tell which case you face at any step, what is the best expectation
-of success you can achieve and how should you behave to achieve it? Now suppose that on each step
-you are told whether you are facing case A or case B (although you still don't know the true action
-values). This is an associative search task. What is the best expectation of success you can
-achieve in this task, and how should you behave to achieve it?
+In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the
+11th step. Why is this? Note that for your answer to be fully satisfactory it
+must explain both why the reward increases on the 11th step and why it
+decreases on the subsequent steps. Hint: if _c_ = 1, then the spike is less
+prominent.
 
 ## Answer:
-
-For the first scenario, you cannot hold individual estimates for the case A and B. Therefore,
-the best approach is to select the action that has best value estimate in combination. In this case,
-the estimates of both actions the same. So the best expectation of success is 0.5 and it can be
-achieved by selecting an action randomly at each step.
-
-
-A<sub>1</sub> = 0.5 \* 0.1 + 0.5 \* 0.9 = 0.5
-
-A<sub>2</sub> = 0.5 \* 0.2 + 0.5 \* 0.8 = 0.5
-
-For the second scenario, you can hold independent estimates for the case A and B, thus we can learn
-the best action for each one treating them as independent bandit problems. The best expectation
-of success is 0.55 obtained from selecting A<sub>2</sub> in case A and A<sub>1</sub> in case B.
-
-0.5 \* 0.2 + 0.5 \* 0.9 = 0.55
-
-
-
-
+Based on the definition of the UCB algorithm, the agent is encouraged to
+explore. Remember, any previously unexplored action is by definition a
+maximizing action, so the agent will have explored all of the 10 (or, in general
+_k_) different arms of the testbed by the 11th step. After exploring all of the different options, the agent will have a better initial estimate of the action values compared to the ε-greedy algorithm (which will have done relatively little exploration). The better initial estimates will lead the UCB-based agent to choose a better action on the 11th step.
+
+It is important to note that while the UCB-based agent has relatively good
+initial estimates of the action values, they are not (necessarily) optimal.
+Therefore, the agent will have to continue interacting with the testbed to
+improve its action value estimates, causing the average reward to drop until the
+estimates start to converge on their expected values.
diff --git a/Chapter 2/images/Exercise_2_7-1.png b/Chapter 2/images/Exercise_2_7-1.png
diff --git a/Chapter 2/images/Exercise_2_7-2.png b/Chapter 2/images/Exercise_2_7-2.png
diff --git a/Chapter 2/images/Exercise_2_7.png b/Chapter 2/images/Exercise_2_7.png
diff --git a/Chapter 2/images/Exercise_2_7_beta.png b/Chapter 2/images/Exercise_2_7_beta.png
diff --git a/Chapter 2/images/Exercise_2_7_o_bar.png b/Chapter 2/images/Exercise_2_7_o_bar.png
diff --git a/Chapter 2/tex_files/exercise2.7.tex b/Chapter 2/tex_files/exercise2.7.tex
@@ -1,29 +1,62 @@
 \documentclass[12pt]{article}
-
 \usepackage[margin=1in]{geometry}
+\usepackage{amsmath}
 
-\begin{document}
-\thispagestyle{empty}
-
-\noindent Using the definition of the \textbf{sigmoid function} we would have:
-
-$$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1}$$
-$$Pr\{A_t = b\} = 1 - Pr\{A_t = a\} = \frac{1}{e^{H_t(a)} + 1}$$
-
-\noindent Extending the definition of the \textbf{soft-max distribution} for the case of two actions (k=2) we get the following:
+\pagenumbering{gobble}
 
-$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^{H_t(b)}}$$
-$$Pr\{A_t=b\} = \frac{e^{H_t(b)}}{e^{H_t(a)} + e^{H_t(b)}}$$
-
-\noindent According to the definition of the \textit{numerical preference}, if we substract an amount from each of the preferences
-it does not affect the probabilities. So we can redifine $H_t(a)$ and $H_t(b)$ as:
-
-$$H_t(b) \leftarrow  H_t(b) - H_t(b) = 0$$
-$$H_t(a) \leftarrow  H_t(a) - H_t(b) = H_t(a)$$
-
-\noindent and we get:
-
-$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^0} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1} $$
-$$Pr\{A_t=b\} = \frac{e^{0}}{e^{H_t(a)} + e^{0}} = \frac{1}{e^{H_t(a)} + 1}$$
+\begin{document}
+\noindent
+Before diving into the solution, let us first focus on the resulting expression from the analysis in \eqref{eq:sample_average}.
+\begin{equation}
+    \tag{2.6}
+    \label{eq:sample_average}
+    Q_{n + 1} = \underbrace{\left(1 - \alpha\right)^nQ_1}_\text{initial bias} + \underbrace{\sum^n_{i = 1} \overbrace{\alpha\left(1 - \alpha\right)^{n - i}}^\text{exponential decay}R_i}_\text{weighted average}
+    \quad\text{.}
+\end{equation}
+From the equation, we can see that we need to find an expression for $Q_n$ that contains a recency-weighted average similar to $\sum^n_{i = 1}\alpha\left(1 - \alpha\right)^{n - i}R_i$ but that does not contain an initial bias.
+\par\bigskip
+
+\noindent
+Let us start by rewriting the incremental average expression for $Q_{n + 1}$.
+To keep the notation less cluttered, we will use the following notation: $Q_n \equiv Q_n(a)$ and $R_n \equiv R_n(a)$.
+\begin{equation*}
+    Q_{n + 1} = Q_n + \beta_n\left[R_n - Q_n\right]
+\end{equation*}
+Now let's substitute $\beta_n$ into the equation and rearrange a little.
+\begin{alignat*}{3}
+    & Q_{n + 1} && = Q_n + \beta_n\left[R_n - Q_n\right] \\
+    & && = \beta_n R_n + \left(1 - \beta_n\right) Q_n \\
+    \implies\quad & Q_{n + 1} && = \left(\frac{\alpha}{\bar{o}_n}\right) R_n + \left(1 - \frac{\alpha}{\bar{o}_n}\right) Q_n \\
+    \implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \bar{o}_n Q_n - \alpha Q_n \\
+    & && = \alpha R_n + \left(\bar{o}_n - \alpha\right) Q_n
+\end{alignat*}
+Now we can substitute in the expression for $\bar{o}_n$ on the right side of our equation and rearrange some more.
+\begin{alignat*}{2}
+    & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(\bar{o}_n - \alpha\right) Q_n \\
+    \implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left[\bar{o}_{n-1} + \alpha \left(1 - \bar{o}_{n - 1}\right) - \alpha\right] Q_n \\
+    & && = \alpha R_n + \left(\bar{o}_{n-1} - \alpha \bar{o}_{n - 1}\right) Q_n \\
+    & && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} Q_n
+\end{alignat*}
+Now we can easily substitute in the expression $Q_n$ and follow the same steps as before.
+\begin{alignat*}{2}
+    & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} Q_n \\
+    \implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} \left[Q_{n - 1} + \beta_{n - 1}\left(R_{n - 1} - Q_{n - 1}\right)\right] \\
+    & && \vdots \\
+    & && = \alpha R_n + \left(1 - \alpha\right) \alpha R_{n - 1} + \left(1 - \alpha\right)^2\bar{o}_{n - 2}Q_{n - 1}
+\end{alignat*}
+We can repeat this while process recursively until we reach $Q_1$ (our initial action value estimate).
+Additionally, we can restructure our equation into a form resembling \eqref{eq:sample_average}:
+\begin{equation*}
+    \bar{o}_n Q_{n + 1} = \left(1 - \alpha\right)^n\bar{o}_0Q_1 + \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i
+    \quad\text{.}
+\end{equation*}
+Remember, $\bar{o}_n \doteq 0$, so we can simplify some more:
+\begin{alignat*}{2}
+    & \bar{o}_n Q_{n + 1} && = \left(1 - \alpha\right)^n\bar{o}_0Q_1 + \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i \\
+    & && = \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i \\
+    \implies\quad & Q_{n + 1} && = \frac{1}{\bar{o}_n} \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i
+    \quad\text{.}
+\end{alignat*}
+Finally, we have shown that $Q_{n + 1}$ is an exponential recency-weighted average without initial bias.
 
 \end{document}