diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 2937a1e0..0eb8ecd8 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -1,5 +1,5 @@ ### A Pluto.jl notebook ### -# v0.12.18 +# v0.12.21 using Markdown using InteractiveUtils @@ -27,53 +27,64 @@ end # ╔═╡ 124ab9da-1ae4-11eb-1c1a-a3b9ac960130 using Turing -# ╔═╡ 9e8252ac-1af6-11eb-3e19-ddd43d9cd1a9 -begin - using StatsPlots - - plot(Beta(2,2), fill=(0, .5,:dodgerblue), ylim=(0,2), legend=false, size=(450, 300)) - xlabel!("p") - ylabel!("Probability") - title!("Prior distribution for p") -end +# ╔═╡ 3a8615d2-8be3-11eb-3913-b97d04bb053a +md"### To do list + +We are currently working on: + +"; + + + # ╔═╡ 9608da48-1acd-11eb-102a-27fbac1ec88c md" # Probabilistic Programming" # ╔═╡ 14b8e170-1ad9-11eb-2111-3784e7029ba0 md" -In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking. +In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking. + + + +We will start this chapter by discussing the fundamentals of another useful tool, that is, probabilistic programming, and more specifically, how to apply it using probabilistic programming languages or PPLs. +These are systems, usually embedded inside a programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically. -We will start this chapter discussing the fundamentals of another useful tool, that is, *Probabilisti Programming*, and more specifically, *Probabilistic Programming Languages* or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically. +In Julia, there are a few PPLs being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing on some examples to explain the general approach when using this tools. -In Julia, there are a few PPL's being developed, and we will be using two of them, *Turing.jl* and *Soss.jl*. We will be focusing in some examples to explain the general approach when using this tools. " # ╔═╡ e8557342-1adc-11eb-3978-33880f4a97a2 md" #### Coin flipping example" # ╔═╡ babb3d32-1adb-11eb-2f52-7799a2031c2b -md" We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective. +md" Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas. So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like: - Is getting heads as likely as getting tails? - Is our coin biased, preferring one output over the other? -To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia. +To answer these questions we are going to build a simple model, with the help of Julia libraries that add PPL capabilities. -Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any *prior* information about the problem? Since the plausibility of getting heads is formally a probability (let's call it *$p$*), we know it must lay between 0 and 1. +Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it $p$), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about $p$. This total uncertainty is also some kind of information we can incorporate in our model. How? -Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a *Uniform* prior distribution for $p$, and the domain of this function will be between 0 and 1. +Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaining values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for $p$, and the domain of this function will be between 0 and 1. + +Do we know anything else? +Let's skip that question for the moment and suppose we don't know anything else about *p*. +This complete uncertainty also constitutes information we can incorporate into our model. +How so? +Because we can assign equal probability to each value of *p* while assigning 0 probability to the remaining values. +This just means we don't know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for *p*, and the function domain will be all numbers between 0 and 1. " # ╔═╡ 10705938-1af3-11eb-27f4-a96a49a43a67 md" So how do we model the outcomes of flipping a coin? -Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability *p* of success –probability of heads–, these are called *Bernoulli* trials. The experiment of performing a number $N$ of Bernoulli trials, gives us the so called *Binomial distribution*. For a fixed value of $N$ and $p$, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, $N$ and $p$ are the *parameters* of our distribution. +Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability *p* of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number $N$ of Bernoulli trials, gives us the so called binomial distribution. For a fixed value of $N$ and $p$, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, $N$ and $p$ are the parameters of our distribution. " # ╔═╡ 1ec4a39e-1af3-11eb-29e3-5d78e4c4613a @@ -83,17 +94,20 @@ Well, if we search for some similar type of experiment, we find that all process @bind p html"" # ╔═╡ 69ad2e2e-1af3-11eb-09e2-bd8e04190c9d -bar(Binomial(N,p), xlim=(0, 90), label=false, xlabel="Succeses", ylabel="Probability", title="Binomial distribution", color="green", alpha=0.8, size=(500, 350)) +begin + using StatsPlots + bar(Binomial(N,p), xlim=(0, 90), label=false, xlabel="Succeses", ylabel="Probability", title="Binomial distribution", color="green", alpha=0.8, size=(500, 350)) +end # ╔═╡ 889c04e2-1ae2-11eb-34c8-7b5a98d73676 md" -So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, $N$ will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with $p$? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply. +So, as we said, we are going to assume that our data (the count number of heads) is generated by a binomial distribution. Here, $N$ will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with $p$? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply. -When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our *Likelihood* function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called **sampling**, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. +When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of these algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. " # ╔═╡ 102b4be2-1ae4-11eb-049d-470a33703b49 -md"The model *coinflip* is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro *@model* previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. +md"The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro `@model` previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol. " @@ -115,8 +129,8 @@ end # ╔═╡ cbe1d1f2-1af4-11eb-0be3-b1a02280acf9 md" -*coinflip* receives the *N* outcomes of our flips, an array of lenght *N* with 0 or 1 values, 0 values indicating tails and 1 indicating heads. -The idea is that with each new value of outcome, the model will be updating its believes about the parameter *p* and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter *p*, a success probability, shared for all the outcomes. +coinflip receives the N outcomes of our flips, an array of length N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. +The idea is that with each new value of outcome, the model will be updating its beliefs about the parameter *p* and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter *p*, a success probability, shared for all the outcomes. Suppose we have run the experiment 10 times and had the outcomes: " @@ -142,17 +156,17 @@ end; # ╔═╡ fa37b106-1af5-11eb-11df-97aada833767 md" -So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail. +So now we plot below the posterior distribution of *p* after our model updated, seeing just the first outcome, a 0 value or a tail. -How this single outcome have affected our beliefs about p? +How this single outcome affected our beliefs about *p*? -We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success. +We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a failure, so it lowers the probability for values of *p* that suggest high rates of success. " # ╔═╡ 0c570210-1af6-11eb-1d5d-5f78f2a000fd begin p_summary = chain[:p] - plot(p_summary, seriestype = :histogram, normed=true, legend=false, size=(400, 250), color="purple", alpha=0.7) + plot(p_summary, seriestype = :histogram, normed=true, legend=false, size=(500, 250), color="purple", alpha=0.7) title!("Posterior distribution of p after getting tails") ylabel!("Probability") xlabel!("p") @@ -160,7 +174,7 @@ end # ╔═╡ 44037220-1af6-11eb-0fee-bbc3f71f6c08 md" -Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of *p* adding outcomes to our model updating its beliefs. +Let's continue now including the remaining outcomes and see how the model is updated. We have plotted below the posterior probability of *p* adding outcomes to our model updating its beliefs. " # ╔═╡ 5280080e-1af6-11eb-3137-75116ca79102 @@ -180,12 +194,23 @@ begin end # ╔═╡ 849c78fe-1af6-11eb-20f0-df587758e966 -md" We see that with each new value the model believes more and more that the value of *p* is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of *p* in between, being the values near 0.5 more plausible with each update." +md" We see that with each new value the model believes more and more that the value of *p* is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values of *p* in between, being the values near 0.5 more plausible with each update." # ╔═╡ 9150c71e-1af6-11eb-1036-8b1b45ed95c4 md"What if we wanted to include more previous knowledge about the success rate *p*? -Let's say we know that the value of *p* is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of *p*. Then including this knowledge, our prior distribution for *p* will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below." +Let's say we know that the value of *p* is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of *p*. Then including this knowledge, our prior distribution for *p* will have higher probability for values near 0.5, and low probability for values near 0 or 1. +Searching again in our repertoire of distributions, one that fulfills our wishes is a beta distribution with parameters α=2 and β=2. It is plotted below." + +# ╔═╡ 9e8252ac-1af6-11eb-3e19-ddd43d9cd1a9 +begin + + + plot(Beta(2,2), fill=(0, .5,:dodgerblue), ylim=(0,2), legend=false, size=(450, 300)) + xlabel!("p") + ylabel!("Probability") + title!("Prior distribution for p") +end # ╔═╡ d0f945f6-1af6-11eb-1f99-e79e7de8af80 md"Now we define again our model just changing the distribution for *p*, as shown:" @@ -207,7 +232,7 @@ begin end # ╔═╡ dc35ef50-1af6-11eb-2f77-d10aee2744dd -md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p." +md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximation of *p*." # ╔═╡ b924b40a-1af7-11eb-13a9-0368b23ab0a4 begin @@ -226,16 +251,30 @@ begin end # ╔═╡ e407fa56-1af7-11eb-18c2-79423a9e4135 -md" To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that *p* near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher." +md" To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that *p* near 0 and 1 have less probability, it catches up faster that probabilities near 0.5 are higher." # ╔═╡ efa0b506-1af7-11eb-2a9a-cb08f7f2d715 -plot(plots[3], plots_[3], title = ["Posterior for Uniform prior and 4 outcomes" "Posterior for Beta prior and 4 outcomes"], titleloc = :center, titlefont = font(8), layout=2, size=(450, 300)) +plot(plots[3], plots_[3], title = ["Posterior for uniform prior and 4 outcomes" "Posterior for beta prior and 4 outcomes"], titleloc = :center, titlefont = font(8), layout=2, size=(450, 300)) # ╔═╡ f719af54-1af7-11eb-05d3-ff9aef8fb6ed -md"So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for *p*, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about *p* so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown." +md"So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for *p*, needing less outcomes to reach a very similar posterior distribution. +When we used a uniform prior, we were conservative, meaning that we said we didn't know anything about *p* so we assign equal probability for all values. +Sometimes this kind of distribution (uniform distributions), called a non-informative prior, can be too conservative, being in some cases not helpful at all. +They even can slow the convergence of the model to the more plausible values for our posterior, as shown." # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 -md" ### Bayesian Bandits " +md" ### Summary + +In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way. + +First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. +We also learned what sampling is and saw why we use it to update our beliefs. +Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. +We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. + +Finally, we created a new model with the prior probability set to a normal distribution centered on *p* = 0.5 which gave us more accurate results. + +" # ╔═╡ 95efe302-35a4-11eb-17ac-3f0ad66fb164 md" @@ -244,7 +283,26 @@ md" * [Not a monad tutorial article about Soss.jl](https://notamonadtutorial.com/soss-probabilistic-programming-with-julia-6acc5add5549) " +# ╔═╡ a77d01ca-8b78-11eb-34dc-b18bf984bf22 +md" ### Give us feedback + + +This book is currently in a beta version. We are looking forward to getting feedback and criticism: + * Submit a GitHub issue **[here](https://github.com/unbalancedparentheses/data_science_in_julia_for_hackers/issues)**. + * Mail us to **martina.cantaro@lambdaclass.com** + +Thank you! +" + + +# ╔═╡ a8fda810-8b78-11eb-21af-d18ab85709b7 +md" +[Next chapter](https://datasciencejuliahackers.com/06_gravity.jl.html) +" + + # ╔═╡ Cell order: +# ╟─3a8615d2-8be3-11eb-3913-b97d04bb053a # ╟─9608da48-1acd-11eb-102a-27fbac1ec88c # ╟─14b8e170-1ad9-11eb-2111-3784e7029ba0 # ╟─e8557342-1adc-11eb-3978-33880f4a97a2 @@ -278,5 +336,7 @@ md" # ╟─e407fa56-1af7-11eb-18c2-79423a9e4135 # ╠═efa0b506-1af7-11eb-2a9a-cb08f7f2d715 # ╟─f719af54-1af7-11eb-05d3-ff9aef8fb6ed -# ╠═92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 +# ╟─92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 # ╟─95efe302-35a4-11eb-17ac-3f0ad66fb164 +# ╟─a77d01ca-8b78-11eb-34dc-b18bf984bf22 +# ╟─a8fda810-8b78-11eb-21af-d18ab85709b7 diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html deleted file mode 100644 index 87de8d03..00000000 --- a/docs/05_prob_prog_intro.jl.html +++ /dev/null @@ -1,4084 +0,0 @@ - - - - - ⚡ Pluto.jl ⚡ - - - - - - - - - - - - -

Probabilistic Programming

-
6.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

-

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

-

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.6 ms

Coin flipping example

-
5.3 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

-

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

-
    -
  • Is getting heads as likely as getting tails?

    -
  • -
  • Is our coin biased, preferring one output over the other?

    -
  • -
-

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

-

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
11.7 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.3 s

So how do we model the outcomes of flipping a coin?

-

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.5 μs
9.5 ms
46.3 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.1 s

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

-

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
12.6 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
9.9 μs
12.1 s
coinflip (generic function with 1 method)
67.7 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

-

Suppose we have run the experiment 10 times and had the outcomes:

-
11.9 μs
outcome
3 μs

So, we got 6 heads and 4 tails.

-

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
6.8 μs
2.5 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

-

How this single outcome have affected our beliefs about p?

-

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.4 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.1 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
5.3 μs
3.7 s
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
59.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
5.6 μs

What if we wanted to include more previous knowledge about the success rate p?

-

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.6 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
22.7 s

Now we define again our model just changing the distribution for p, as shown:

-
6.1 μs
coinflip_beta_prior (generic function with 1 method)
81.7 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
5.5 μs
4 s
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
81 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
6.4 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
953 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
6.5 μs

Bayesian Bandits

-
3.2 μs

Now we are going to tackle another well known problem, the bandit or multi-armed bandit problem. Altough it is conceived thinking about a strategy for a casino situation, there exist a lot of different settings where the same strategy could be applied.

-

The situation, in it's simpler form, goes like this: you are in a casino, with a limited amount of casino chips. In front of you there are some slot machines (say, three of them for simplicity). Each machine has some probability pm of giving you $1 associated with it, but every machine has a different probability. There are two main problems. First, we don't know this probabilities beforehand, so we will have to develop some explorative process in order to gather some information about the machines. The second problem is that our chips –and thus our possible trials– are limited, and we want to take the most profit we can out of the machines. How do we do this? Finding the machine with the highest success probability and keep playing on it. This tradeoff is commonly known as explore vs. exploit. If we had one million chips we could simply play a lot of times in each machine and thus make a good estimate about their probabilities, but our reward may not be very good, because we would have played so many chips in machines that were not our best option. Conversely, we may have found a machine which we know that has a good success probability, but if we don't explore the other machines also, we won't know if it is the best of our options.

-
10.7 μs

This is a kind of problem that is very suited for the bayesian way of thinking. We start with some information about the slot machines (in the worst case, we know nothing), and we will update our beliefs with the results of our trials. A methodology exists for this explore vs. exploit dilemmas, which is called Thompson sampling. The algorithm can be thought in these succesive steps:

-
    -
  • First, assign some probability distribution for your knowledge of the success probability of each slot machine.

    -
  • -
  • Sample randomly from each of this distributions and check which is the maximum sampled probability.

    -
  • -
  • Pull the arm of the machine corresponding to that maximum value.

    -
  • -
  • Update the probability with the result of the experiment.

    -
  • -
  • Repeat from step 2.

    -
  • -
-

Here we will take some advantage about the math that can be used to model our situation. To model the generation of our data, we can use a distribution we have already introduced, the Binomial distribution. We choose it because every trial can give us a success (we win $1) or fail (we don't win anything), and this distribution has some parameter p, that we don't know with certainty. So we now use a prior tu set our knowledge before making a trial on the slot machine. The thing is, there exists a mathematical hack called conjugate priors. When a likelihood distribution is multiplied by its conjugate prior, the posterior distribution is the same as the prior with its corresponding parameters updated. This trick frees us from the need of using more computation-expensive techniques such as MCMC. In the particular case of the Binomial distribution, the conjugate prior is the Beta distribution. This is a very flexible distribution, as we can obtain a lot of other distributions as particular cases of the Beta, with specific combinations of its parameters. Below you can see some of the fancy shapes it can obtain

-
9.4 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
284 ms
4.2 ms
pull_arm (generic function with 1 method)
19.4 μs
sample_bandit (generic function with 1 method)
23.7 μs
update_bandit (generic function with 1 method)
27.1 μs
N_TRIALS
100
1.2 μs
BANDIT_PROBABILITIES
3.7 μs
beta_bandit_experiment (generic function with 1 method)
62.1 μs
346 ms
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
745 ns
- - - - \ No newline at end of file