Skip to content

Update Exercises 2.7 and 2.8 to 2nd edition version #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions Chapter 2/Exercise 2.6.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

## Question:
The results shown in Figure 2.3 should be quite reliable because they
are averages over 2000 individual, randomly chosen 10-armed bandit
tasks. Why, then, are there oscillations and spikes in the early part
of the curve for the optimistic method? In other words, what might
make this method perform particularly better or worse, on average,
are averages over 2000 individual, randomly chosen 10-armed bandit
tasks. Why, then, are there oscillations and spikes in the early part
of the curve for the optimistic method? In other words, what might
make this method perform particularly better or worse, on average,
on particular early steps?

## Answer:
Expand All @@ -16,6 +16,3 @@ and has to try each multiple times before it realises they're actually
bad. This will depend on how quickly these bad options drop below the
best option (once that occurs on exploration will reduce them to their
correct values over time but at a much more reduced rate)



24 changes: 19 additions & 5 deletions Chapter 2/Exercise 2.7.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,25 @@
# Exercise 2.7

## Question:
Show that in the case of two actions, the soft-max distribution is the same as
that given by the logistic, or sigmoid, function often used in statistics and
artificial neural networks.
In most of this chapter we have used sample averages to estimate action values
because sample averages do not produce the initial bias that constant step sizes
do (see the analysis leading to (2.6)). However, sample averages are not a
completely satisfactory solution because they may perform poorly on
nonstationary problems. Is it possible to avoid the bias of constant step sizes
while retaining their advantages on nonstationary problems? One way is to use a
step size of

## Answer:
![Exercise 2.7 beta equation](images/Exercise_2_7_beta.png)

to process the nth reward for a particular action, where _α_ > 0 is a
conventional constant step size, and _ōₙ_ is a trace of one that starts at 0:

![Exercise 2.7 o-bar equation](images/Exercise_2_7_o_bar.png)

![Exercise 2.7 solution](images/Exercise_2_7.png)
Carry out an analysis like that in (2.6) to show that _Qₙ_ is an exponential
recency-weighted average _without initial bias_.

## Answer:

![Exercise 2.7 solution page 1](images/Exercise_2_7-1.png)
![Exercise 2.7 solution page 2](images/Exercise_2_7-2.png)
43 changes: 15 additions & 28 deletions Chapter 2/Exercise 2.8.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,20 @@
# Exercise 2.8

## Question:
Suppose you face a 2-armed bandit task whose true action values change randomly from time step
to time step. Specifically, suppose that, for any time step, the true values of action 1 and 2
are respectively 0.1 and 0.2 with probability 0.5 (case A), and 0.9 and 0.8 with probability 0.5
(case B). If you are not able to tell which case you face at any step, what is the best expectation
of success you can achieve and how should you behave to achieve it? Now suppose that on each step
you are told whether you are facing case A or case B (although you still don't know the true action
values). This is an associative search task. What is the best expectation of success you can
achieve in this task, and how should you behave to achieve it?
In Figure 2.4 the UCB algorithm shows a distinct spike in performance on the
11th step. Why is this? Note that for your answer to be fully satisfactory it
must explain both why the reward increases on the 11th step and why it
decreases on the subsequent steps. Hint: if _c_ = 1, then the spike is less
prominent.

## Answer:

For the first scenario, you cannot hold individual estimates for the case A and B. Therefore,
the best approach is to select the action that has best value estimate in combination. In this case,
the estimates of both actions the same. So the best expectation of success is 0.5 and it can be
achieved by selecting an action randomly at each step.


A<sub>1</sub> = 0.5 \* 0.1 + 0.5 \* 0.9 = 0.5

A<sub>2</sub> = 0.5 \* 0.2 + 0.5 \* 0.8 = 0.5

For the second scenario, you can hold independent estimates for the case A and B, thus we can learn
the best action for each one treating them as independent bandit problems. The best expectation
of success is 0.55 obtained from selecting A<sub>2</sub> in case A and A<sub>1</sub> in case B.

0.5 \* 0.2 + 0.5 \* 0.9 = 0.55




Based on the definition of the UCB algorithm, the agent is encouraged to
explore. Remember, any previously unexplored action is by definition a
maximizing action, so the agent will have explored all of the 10 (or, in general
_k_) different arms of the testbed by the 11th step. After exploring all of the different options, the agent will have a better initial estimate of the action values compared to the ε-greedy algorithm (which will have done relatively little exploration). The better initial estimates will lead the UCB-based agent to choose a better action on the 11th step.

It is important to note that while the UCB-based agent has relatively good
initial estimates of the action values, they are not (necessarily) optimal.
Therefore, the agent will have to continue interacting with the testbed to
improve its action value estimates, causing the average reward to drop until the
estimates start to converge on their expected values.
Binary file added Chapter 2/images/Exercise_2_7-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Chapter 2/images/Exercise_2_7-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed Chapter 2/images/Exercise_2_7.png
Binary file not shown.
Binary file added Chapter 2/images/Exercise_2_7_beta.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Chapter 2/images/Exercise_2_7_o_bar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
79 changes: 56 additions & 23 deletions Chapter 2/tex_files/exercise2.7.tex
Original file line number Diff line number Diff line change
@@ -1,29 +1,62 @@
\documentclass[12pt]{article}

\usepackage[margin=1in]{geometry}
\usepackage{amsmath}

\begin{document}
\thispagestyle{empty}

\noindent Using the definition of the \textbf{sigmoid function} we would have:

$$Pr\{A_t = a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1}$$
$$Pr\{A_t = b\} = 1 - Pr\{A_t = a\} = \frac{1}{e^{H_t(a)} + 1}$$

\noindent Extending the definition of the \textbf{soft-max distribution} for the case of two actions (k=2) we get the following:
\pagenumbering{gobble}

$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^{H_t(b)}}$$
$$Pr\{A_t=b\} = \frac{e^{H_t(b)}}{e^{H_t(a)} + e^{H_t(b)}}$$

\noindent According to the definition of the \textit{numerical preference}, if we substract an amount from each of the preferences
it does not affect the probabilities. So we can redifine $H_t(a)$ and $H_t(b)$ as:

$$H_t(b) \leftarrow H_t(b) - H_t(b) = 0$$
$$H_t(a) \leftarrow H_t(a) - H_t(b) = H_t(a)$$

\noindent and we get:

$$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{e^{H_t(a)} + e^0} = \frac{e^{H_t(a)}}{e^{H_t(a)} + 1} $$
$$Pr\{A_t=b\} = \frac{e^{0}}{e^{H_t(a)} + e^{0}} = \frac{1}{e^{H_t(a)} + 1}$$
\begin{document}
\noindent
Before diving into the solution, let us first focus on the resulting expression from the analysis in \eqref{eq:sample_average}.
\begin{equation}
\tag{2.6}
\label{eq:sample_average}
Q_{n + 1} = \underbrace{\left(1 - \alpha\right)^nQ_1}_\text{initial bias} + \underbrace{\sum^n_{i = 1} \overbrace{\alpha\left(1 - \alpha\right)^{n - i}}^\text{exponential decay}R_i}_\text{weighted average}
\quad\text{.}
\end{equation}
From the equation, we can see that we need to find an expression for $Q_n$ that contains a recency-weighted average similar to $\sum^n_{i = 1}\alpha\left(1 - \alpha\right)^{n - i}R_i$ but that does not contain an initial bias.
\par\bigskip

\noindent
Let us start by rewriting the incremental average expression for $Q_{n + 1}$.
To keep the notation less cluttered, we will use the following notation: $Q_n \equiv Q_n(a)$ and $R_n \equiv R_n(a)$.
\begin{equation*}
Q_{n + 1} = Q_n + \beta_n\left[R_n - Q_n\right]
\end{equation*}
Now let's substitute $\beta_n$ into the equation and rearrange a little.
\begin{alignat*}{3}
& Q_{n + 1} && = Q_n + \beta_n\left[R_n - Q_n\right] \\
& && = \beta_n R_n + \left(1 - \beta_n\right) Q_n \\
\implies\quad & Q_{n + 1} && = \left(\frac{\alpha}{\bar{o}_n}\right) R_n + \left(1 - \frac{\alpha}{\bar{o}_n}\right) Q_n \\
\implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \bar{o}_n Q_n - \alpha Q_n \\
& && = \alpha R_n + \left(\bar{o}_n - \alpha\right) Q_n
\end{alignat*}
Now we can substitute in the expression for $\bar{o}_n$ on the right side of our equation and rearrange some more.
\begin{alignat*}{2}
& \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(\bar{o}_n - \alpha\right) Q_n \\
\implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left[\bar{o}_{n-1} + \alpha \left(1 - \bar{o}_{n - 1}\right) - \alpha\right] Q_n \\
& && = \alpha R_n + \left(\bar{o}_{n-1} - \alpha \bar{o}_{n - 1}\right) Q_n \\
& && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} Q_n
\end{alignat*}
Now we can easily substitute in the expression $Q_n$ and follow the same steps as before.
\begin{alignat*}{2}
& \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} Q_n \\
\implies\quad & \bar{o}_n Q_{n + 1} && = \alpha R_n + \left(1 - \alpha \right) \bar{o}_{n - 1} \left[Q_{n - 1} + \beta_{n - 1}\left(R_{n - 1} - Q_{n - 1}\right)\right] \\
& && \vdots \\
& && = \alpha R_n + \left(1 - \alpha\right) \alpha R_{n - 1} + \left(1 - \alpha\right)^2\bar{o}_{n - 2}Q_{n - 1}
\end{alignat*}
We can repeat this while process recursively until we reach $Q_1$ (our initial action value estimate).
Additionally, we can restructure our equation into a form resembling \eqref{eq:sample_average}:
\begin{equation*}
\bar{o}_n Q_{n + 1} = \left(1 - \alpha\right)^n\bar{o}_0Q_1 + \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i
\quad\text{.}
\end{equation*}
Remember, $\bar{o}_n \doteq 0$, so we can simplify some more:
\begin{alignat*}{2}
& \bar{o}_n Q_{n + 1} && = \left(1 - \alpha\right)^n\bar{o}_0Q_1 + \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i \\
& && = \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i \\
\implies\quad & Q_{n + 1} && = \frac{1}{\bar{o}_n} \sum^n_{i = 1}\alpha \left(1 - \alpha\right)^{n - i}R_i
\quad\text{.}
\end{alignat*}
Finally, we have shown that $Q_{n + 1}$ is an exponential recency-weighted average without initial bias.

\end{document}