From 313be7cbddd8531baf046d2adcb91ecf3defcfa1 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Thu, 4 Mar 2021 11:07:13 -0300 Subject: [PATCH 01/13] Chapter 5: Summary added --- 05_prob_prog_intro/05_prob_prog_intro.jl | 12 +- docs/05_prob_prog_intro.jl.html | 4024 ++++++++++------------ 2 files changed, 1798 insertions(+), 2238 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 2937a1e0..00d6b533 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -1,5 +1,5 @@ ### A Pluto.jl notebook ### -# v0.12.18 +# v0.12.21 using Markdown using InteractiveUtils @@ -235,7 +235,13 @@ plot(plots[3], plots_[3], title = ["Posterior for Uniform prior and 4 outcomes" md"So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for *p*, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about *p* so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown." # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 -md" ### Bayesian Bandits " +md" ### Summary + +In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way. + +We use the Julia library Turing.jl to instantiate a model where we set the prior probability and the distribution of the outcomes of our experiment. Then we use the Markov Chain Monte Carlo algorithm for sampling and saw how our posterior distribution updates with the input of new outcomes. + +Finally, we experiment on how changes in our prior distributions affect the results we obtain" # ╔═╡ 95efe302-35a4-11eb-17ac-3f0ad66fb164 md" @@ -278,5 +284,5 @@ md" # ╟─e407fa56-1af7-11eb-18c2-79423a9e4135 # ╠═efa0b506-1af7-11eb-2a9a-cb08f7f2d715 # ╟─f719af54-1af7-11eb-05d3-ff9aef8fb6ed -# ╠═92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 +# ╟─92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 # ╟─95efe302-35a4-11eb-17ac-3f0ad66fb164 diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 87de8d03..4b4d9039 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -5,13 +5,11 @@ ⚡ Pluto.jl ⚡ - - - + + + - - -

Probabilistic Programming

-
6.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
5.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.6 ms

Coin flipping example

-
5.3 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.7 ms

Coin flipping example

+
4.4 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -166,2006 +164,1994 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
11.7 μs
+
2.6 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.3 s

So how do we model the outcomes of flipping a coin?

+
19.3 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.5 μs
9.5 ms
46.3 μs
+
14.0 μs
10.1 ms
57.1 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.1 s

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
567 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
12.6 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
9.9 μs
12.1 s
coinflip (generic function with 1 method)
67.7 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
9.2 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
11.0 μs
19.4 s
coinflip (generic function with 1 method)
83.5 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
11.9 μs
outcome
3 μs

So, we got 6 heads and 4 tails.

+
8.4 μs
outcome
1.5 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
6.8 μs
2.5 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
4.8 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.4 μs
+
6.1 μs
- + - - + - - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.1 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
5.3 μs
3.7 s
+
2.4 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.2 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - + - - + - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
59.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
5.6 μs

What if we wanted to include more previous knowledge about the success rate p?

+
14.0 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
5.2 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.6 μs
+
10.0 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
22.7 s

Now we define again our model just changing the distribution for p, as shown:

-
6.1 μs
coinflip_beta_prior (generic function with 1 method)
81.7 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
5.5 μs
4 s
+
16.2 s

Now we define again our model just changing the distribution for p, as shown:

+
4.9 μs
coinflip_beta_prior (generic function with 1 method)
71.7 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
3.0 μs
2.7 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
81 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
6.4 μs
+
14.7 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
4.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
953 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
6.5 μs

Bayesian Bandits

-
3.2 μs

Now we are going to tackle another well known problem, the bandit or multi-armed bandit problem. Altough it is conceived thinking about a strategy for a casino situation, there exist a lot of different settings where the same strategy could be applied.

-

The situation, in it's simpler form, goes like this: you are in a casino, with a limited amount of casino chips. In front of you there are some slot machines (say, three of them for simplicity). Each machine has some probability pm of giving you $1 associated with it, but every machine has a different probability. There are two main problems. First, we don't know this probabilities beforehand, so we will have to develop some explorative process in order to gather some information about the machines. The second problem is that our chips –and thus our possible trials– are limited, and we want to take the most profit we can out of the machines. How do we do this? Finding the machine with the highest success probability and keep playing on it. This tradeoff is commonly known as explore vs. exploit. If we had one million chips we could simply play a lot of times in each machine and thus make a good estimate about their probabilities, but our reward may not be very good, because we would have played so many chips in machines that were not our best option. Conversely, we may have found a machine which we know that has a good success probability, but if we don't explore the other machines also, we won't know if it is the best of our options.

-
10.7 μs

This is a kind of problem that is very suited for the bayesian way of thinking. We start with some information about the slot machines (in the worst case, we know nothing), and we will update our beliefs with the results of our trials. A methodology exists for this explore vs. exploit dilemmas, which is called Thompson sampling. The algorithm can be thought in these succesive steps:

+
713 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
9.6 μs

Summary

+

In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way.

+

We use the Julia library Turing.jl to instantiate a model where we set the prior probability and the distribution of the outcomes of our experiment. Then we use the Markov Chain Monte Carlo algorithm for sampling and saw how our posterior distribution updates with the input of new outcomes.

+

Finally, we experiment on how changes in our prior distributions affect the results we obtain

+
9.2 μs

References

    -
  • First, assign some probability distribution for your knowledge of the success probability of each slot machine.

    -
  • -
  • Sample randomly from each of this distributions and check which is the maximum sampled probability.

    -
  • -
  • Pull the arm of the machine corresponding to that maximum value.

    +
  • Turing.jl website

  • -
  • Update the probability with the result of the experiment.

    -
  • -
  • Repeat from step 2.

    +
  • Not a monad tutorial article about Soss.jl

-

Here we will take some advantage about the math that can be used to model our situation. To model the generation of our data, we can use a distribution we have already introduced, the Binomial distribution. We choose it because every trial can give us a success (we win $1) or fail (we don't win anything), and this distribution has some parameter p, that we don't know with certainty. So we now use a prior tu set our knowledge before making a trial on the slot machine. The thing is, there exists a mathematical hack called conjugate priors. When a likelihood distribution is multiplied by its conjugate prior, the posterior distribution is the same as the prior with its corresponding parameters updated. This trick frees us from the need of using more computation-expensive techniques such as MCMC. In the particular case of the Binomial distribution, the conjugate prior is the Beta distribution. This is a very flexible distribution, as we can obtain a lot of other distributions as particular cases of the Beta, with specific combinations of its parameters. Below you can see some of the fancy shapes it can obtain

-
9.4 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
284 ms
4.2 ms
pull_arm (generic function with 1 method)
19.4 μs
sample_bandit (generic function with 1 method)
23.7 μs
update_bandit (generic function with 1 method)
27.1 μs
N_TRIALS
100
1.2 μs
BANDIT_PROBABILITIES
3.7 μs
beta_bandit_experiment (generic function with 1 method)
62.1 μs
346 ms
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
745 ns
- +12.2 μs + \ No newline at end of file From 8bc0cd3487696942752deb909f4bf0789b84dfc1 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 9 Mar 2021 16:38:58 -0300 Subject: [PATCH 02/13] corrections --- 05_prob_prog_intro/05_prob_prog_intro.jl | 7 +- docs/05_prob_prog_intro.jl.html | 3590 +++++++++++----------- 2 files changed, 1802 insertions(+), 1795 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 00d6b533..b7912a3b 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -152,7 +152,7 @@ We can see in the plot below, showing the posterior or updated distribution of * # ╔═╡ 0c570210-1af6-11eb-1d5d-5f78f2a000fd begin p_summary = chain[:p] - plot(p_summary, seriestype = :histogram, normed=true, legend=false, size=(400, 250), color="purple", alpha=0.7) + plot(p_summary, seriestype = :histogram, normed=true, legend=false, size=(500, 250), color="purple", alpha=0.7) title!("Posterior distribution of p after getting tails") ylabel!("Probability") xlabel!("p") @@ -239,9 +239,10 @@ md" ### Summary In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way. -We use the Julia library Turing.jl to instantiate a model where we set the prior probability and the distribution of the outcomes of our experiment. Then we use the Markov Chain Monte Carlo algorithm for sampling and saw how our posterior distribution updates with the input of new outcomes. +First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs.Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be an uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. -Finally, we experiment on how changes in our prior distributions affect the results we obtain" +Finally we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model. +" # ╔═╡ 95efe302-35a4-11eb-17ac-3f0ad66fb164 md" diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 4b4d9039..404d194c 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
5.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
6.3 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.7 ms

Coin flipping example

-
4.4 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.8 ms

Coin flipping example

+
9.1 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,1994 +164,2006 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.6 ms
+
2.8 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
19.3 s

So how do we model the outcomes of flipping a coin?

+
18.7 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
14.0 μs
10.1 ms
57.1 μs
+
12.7 μs
10.5 ms
51.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
567 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
597 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
9.2 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
11.0 μs
19.4 s
coinflip (generic function with 1 method)
83.5 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
12.1 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
10.3 μs
20.0 s
coinflip (generic function with 1 method)
73.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
8.4 μs
outcome
1.5 μs

So, we got 6 heads and 4 tails.

+
7.7 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
4.8 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
4.7 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.1 μs
- +
6.3 μs
+ - - + + - - - + + - - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.4 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.2 μs
2.8 s
+
3.0 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.3 μs
3.0 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - + - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
14.0 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
5.2 μs

What if we wanted to include more previous knowledge about the success rate p?

+
17.0 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
5.0 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
10.0 μs
+
14.1 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
16.2 s

Now we define again our model just changing the distribution for p, as shown:

-
4.9 μs
coinflip_beta_prior (generic function with 1 method)
71.7 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
3.0 μs
2.7 s
+
16.8 s

Now we define again our model just changing the distribution for p, as shown:

+
9.3 μs
coinflip_beta_prior (generic function with 1 method)
68.9 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
4.5 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - + - - + - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
14.7 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
4.5 μs
+
16.7 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
6.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
713 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
9.6 μs

Summary

+
866 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
5.5 μs

Summary

In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way.

-

We use the Julia library Turing.jl to instantiate a model where we set the prior probability and the distribution of the outcomes of our experiment. Then we use the Markov Chain Monte Carlo algorithm for sampling and saw how our posterior distribution updates with the input of new outcomes.

-

Finally, we experiment on how changes in our prior distributions affect the results we obtain

-
9.2 μs

References

+

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs.Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be an uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+

Finally we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model.

+
17.6 μs
12.2 μs
+10.4 μs From 9b338115e006e576f554dc4e00c0f60335f22563 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Wed, 10 Mar 2021 17:20:06 -0300 Subject: [PATCH 03/13] Added Enter after each Point --- 05_prob_prog_intro/05_prob_prog_intro.jl | 12 +- docs/05_prob_prog_intro.jl.html | 3672 +++++++++++----------- 2 files changed, 1844 insertions(+), 1840 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index b7912a3b..adbe7561 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -237,11 +237,15 @@ md"So in this case, incorporating our beliefs in the prior distribution we saw t # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 md" ### Summary -In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way. +In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. + +First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. +Later we explained the concept of sampling and why we used it to make an update on our beliefs. +Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. +So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. + +Finally, we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model. -First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs.Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be an uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. - -Finally we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model. " # ╔═╡ 95efe302-35a4-11eb-17ac-3f0ad66fb164 diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 404d194c..4969d010 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
6.3 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
9.6 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.8 ms

Coin flipping example

-
9.1 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.4 ms

Coin flipping example

+
6.7 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,2006 +164,2030 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.8 ms
+
2.7 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.7 s

So how do we model the outcomes of flipping a coin?

+
18.9 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
12.7 μs
10.5 ms
51.4 μs
+
13.3 μs
11.7 ms
47.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
597 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
604 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
12.1 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
10.3 μs
20.0 s
coinflip (generic function with 1 method)
73.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
6.7 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
10.6 μs
20.5 s
coinflip (generic function with 1 method)
73.8 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
7.7 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

+
7.8 μs
outcome
1.4 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
4.7 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
4.9 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.3 μs
+
7.8 μs
- + - - + - - - + + - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - - - - - - - - - - - - -
3.0 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.3 μs
3.0 s
+
2.6 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.3 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - + - - - - + - - - - - - - - - - - - - + + - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
17.0 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
5.0 μs

What if we wanted to include more previous knowledge about the success rate p?

+
14.8 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
7.5 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
14.1 μs
+
11.7 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
16.8 s

Now we define again our model just changing the distribution for p, as shown:

-
9.3 μs
coinflip_beta_prior (generic function with 1 method)
68.9 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
4.5 μs
2.8 s
+
16.6 s

Now we define again our model just changing the distribution for p, as shown:

+
9.2 μs
coinflip_beta_prior (generic function with 1 method)
68.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
3.8 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
16.7 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
6.5 μs
+
14.9 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
5.3 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - - - - - - - - - - - -
866 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
5.5 μs

Summary

-

In this chapter we gave an introduction to Probabilistic Programming Languages exploring the classic coin flipping example in a Bayesian way.

-

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs.Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be an uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

-

Finally we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model.

-
17.6 μs

References

+
729 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
6.0 μs

Summary

+

In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

+

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+

Finally, we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model.

+
29.8 μs
10.4 μs
+11.4 μs From a9050c3cbe191c279d6be69566843318c91c1f5a Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Fri, 12 Mar 2021 11:46:29 -0300 Subject: [PATCH 04/13] corrections --- 05_prob_prog_intro/05_prob_prog_intro.jl | 6 +- docs/05_prob_prog_intro.jl.html | 3632 +++++++++++----------- 2 files changed, 1828 insertions(+), 1810 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index adbe7561..e0d5705a 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -240,11 +240,11 @@ md" ### Summary In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. -Later we explained the concept of sampling and why we used it to make an update on our beliefs. +We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. -So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. +So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. -Finally, we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model. +Finally, we created a new model with the prior probability set to a normal distribution centered on *p* equals 0.5 which gave us more accurate results. " diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 4969d010..3a5cb419 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
9.6 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
6.1 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.4 ms

Coin flipping example

-
6.7 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.5 ms

Coin flipping example

+
5.2 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,2030 +164,2006 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.7 ms
+
2.3 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.9 s

So how do we model the outcomes of flipping a coin?

+
19.5 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
13.3 μs
11.7 ms
47.4 μs
+
11.5 μs
9.4 ms
48.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
604 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
549 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
6.7 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
10.6 μs
20.5 s
coinflip (generic function with 1 method)
73.8 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
6.6 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
8.8 μs
19.5 s
coinflip (generic function with 1 method)
70.8 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
7.8 μs
outcome
1.4 μs

So, we got 6 heads and 4 tails.

+
8.0 μs
outcome
1.4 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
4.9 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
5.2 μs
1.7 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
7.8 μs
+
6.3 μs
- + - - + - - - + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.6 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.3 μs
2.8 s
+
2.4 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.0 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
14.8 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
7.5 μs

What if we wanted to include more previous knowledge about the success rate p?

+
21.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
4.7 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
11.7 μs
+
7.3 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
16.6 s

Now we define again our model just changing the distribution for p, as shown:

-
9.2 μs
coinflip_beta_prior (generic function with 1 method)
68.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
3.8 μs
2.8 s
+
16.7 s

Now we define again our model just changing the distribution for p, as shown:

+
5.5 μs
coinflip_beta_prior (generic function with 1 method)
76.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
3.6 μs
2.9 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + - + - - - - - + - - - - - - - - - + - - + - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - + - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - + - - - - - - - - - - - + - - - + - - + + - - - - - - - - - - - - - - - - + + -
14.9 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
5.3 μs
+
19.4 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
10.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
729 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
6.0 μs

Summary

+
774 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
10.3 μs

Summary

In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

-

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. Later we explained the concept of sampling and why we used it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov Chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

-

Finally, we repeated the experiment but this time we set our prior probability to have a beta distribution centered around 0.5, and saw how this affected the results of the model.

-
29.8 μs

References

+

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+

Finally, we created a new model with the prior probability set to a normal distribution centered on p equals 0.5 which gave us more accurate results.

+
1.4 ms
11.4 μs
+10.6 μs From 6b3018185aee2eedf616eefa9595819157549de6 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Fri, 12 Mar 2021 12:24:57 -0300 Subject: [PATCH 05/13] added comas --- 05_prob_prog_intro/05_prob_prog_intro.jl | 4 +- docs/05_prob_prog_intro.jl.html | 3676 +++++++++++----------- 2 files changed, 1807 insertions(+), 1873 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index e0d5705a..17860d8a 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -237,9 +237,9 @@ md"So in this case, incorporating our beliefs in the prior distribution we saw t # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 md" ### Summary -In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. +In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. -First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. +First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 3a5cb419..cf4964c4 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
6.1 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
7.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.5 ms

Coin flipping example

-
5.2 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.4 ms

Coin flipping example

+
6.1 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,2006 +164,1994 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.3 ms
+
2.4 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
19.5 s

So how do we model the outcomes of flipping a coin?

+
19.8 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.5 μs
9.4 ms
48.9 μs
+
11.8 μs
9.8 ms
57.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
549 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
583 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
6.6 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
8.8 μs
19.5 s
coinflip (generic function with 1 method)
70.8 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
6.5 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
8.9 μs
20.3 s
coinflip (generic function with 1 method)
65.7 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
8.0 μs
outcome
1.4 μs

So, we got 6 heads and 4 tails.

+
9.0 μs
outcome
1.2 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
5.2 μs
1.7 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
6.1 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.3 μs
+
8.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.4 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.0 μs
2.8 s
+
3.5 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
6.4 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - + - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
21.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
4.7 μs

What if we wanted to include more previous knowledge about the success rate p?

+
15.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
7.1 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.3 μs
+
11.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - -
16.7 s

Now we define again our model just changing the distribution for p, as shown:

-
5.5 μs
coinflip_beta_prior (generic function with 1 method)
76.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
3.6 μs
2.9 s
+
26.1 μs
coinflip_beta_prior (generic function with 1 method)
87.4 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
9.2 μs
2.7 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - + - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
19.4 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
10.5 μs
+
14.9 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
57.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
774 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
10.3 μs

Summary

-

In this chapter we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

-

First we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+
985 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
14.0 μs

Summary

+

In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

+

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

Finally, we created a new model with the prior probability set to a normal distribution centered on p equals 0.5 which gave us more accurate results.

1.4 ms

References

-
10.6 μs
+12.0 μs From de15bc8777bdd25daf18a3b81be887f5d117fc4a Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Fri, 12 Mar 2021 14:18:33 -0300 Subject: [PATCH 06/13] corrections --- 05_prob_prog_intro/05_prob_prog_intro.jl | 8 +- docs/05_prob_prog_intro.jl.html | 3694 +++++++++++----------- 2 files changed, 1902 insertions(+), 1800 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 17860d8a..7df34eff 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -239,12 +239,12 @@ md" ### Summary In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. -First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. -We also learned what sampling is and saw why we use it to make an update on our beliefs. -Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. +First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. +We also learned what sampling is and saw why we use it to update our beliefs. +Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to a uniform distribution and the likelihood to a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. -Finally, we created a new model with the prior probability set to a normal distribution centered on *p* equals 0.5 which gave us more accurate results. +Finally, we created a new model with the prior probability set to a normal distribution centered on *p* = 0.5 which gave us more accurate results. " diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index cf4964c4..f9e33520 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
7.7 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
8.0 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.4 ms

Coin flipping example

-
6.1 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
3.1 ms

Coin flipping example

+
4.8 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,1994 +164,2030 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.4 ms
+
2.2 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
19.8 s

So how do we model the outcomes of flipping a coin?

+
20.0 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.8 μs
9.8 ms
57.8 μs
+
12.6 μs
13.0 ms
50.6 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
583 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
586 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
6.5 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
8.9 μs
20.3 s
coinflip (generic function with 1 method)
65.7 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
7.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
9.6 μs
21.2 s
coinflip (generic function with 1 method)
63.5 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
9.0 μs
outcome
1.2 μs

So, we got 6 heads and 4 tails.

+
9.0 μs
outcome
1.1 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
6.1 μs
1.8 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
5.2 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
8.5 μs
+
5.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - + - - + - - + + - - - - - - - - - - - - - - - - - - -
3.5 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
6.4 μs
2.8 s
+
2.2 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.0 μs
2.9 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - + - - - + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - + - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
15.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
7.1 μs

What if we wanted to include more previous knowledge about the success rate p?

+
14.9 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
4.9 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
11.9 μs
+
7.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
16.7 s

Now we define again our model just changing the distribution for p, as shown:

-
26.1 μs
coinflip_beta_prior (generic function with 1 method)
87.4 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
9.2 μs
2.7 s
+
17.3 s

Now we define again our model just changing the distribution for p, as shown:

+
4.2 μs
coinflip_beta_prior (generic function with 1 method)
68.8 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
2.7 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - + - - - - - - + - + - - - - - - - - - - + + + + + - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - + - - - - + - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - + - - - - + - - - - - - - - - - - - - - -
14.9 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
57.2 μs
+
15.4 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
3.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - + - - - + - - - - - - - - - - - - - - - - - - - - + - + - - - - - - - - - + - + - - - - - - + + - - - + + + - - - - - - - - - - - - - - - - -
985 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
14.0 μs

Summary

+
570 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
5.0 μs

Summary

In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

-

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to have a binomial distribution. We also learned what sampling is and saw why we use it to make an update on our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to be a uniform distribution and the likelihood to have a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

-

Finally, we created a new model with the prior probability set to a normal distribution centered on p equals 0.5 which gave us more accurate results.

-
1.4 ms

References

+

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to a uniform distribution and the likelihood to a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

+
11.2 μs
12.0 μs
+11.6 μs From 8caf57a104af57dec38cbde40125d33e463012d8 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Thu, 18 Mar 2021 16:02:07 -0300 Subject: [PATCH 07/13] corrections --- 05_prob_prog_intro/05_prob_prog_intro.jl | 6 +- docs/05_prob_prog_intro.jl.html | 3614 +++++++++++----------- 2 files changed, 1783 insertions(+), 1837 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 7df34eff..7f7f26df 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -237,12 +237,12 @@ md"So in this case, incorporating our beliefs in the prior distribution we saw t # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 md" ### Summary -In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way. +In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way. First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. -Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to a uniform distribution and the likelihood to a binomial one. -So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. +Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. +We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result. Finally, we created a new model with the prior probability set to a normal distribution centered on *p* = 0.5 which gave us more accurate results. diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index f9e33520..d55fa0f3 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
8.0 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
6.5 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
3.1 ms

Coin flipping example

-
4.8 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.4 ms

Coin flipping example

+
5.5 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,2030 +164,1994 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.2 ms
+
2.4 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
20.0 s

So how do we model the outcomes of flipping a coin?

+
18.1 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
12.6 μs
13.0 ms
50.6 μs
+
12.7 μs
10.1 ms
47.6 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
586 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
580 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
7.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
9.6 μs
21.2 s
coinflip (generic function with 1 method)
63.5 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
6.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
10.7 μs
104 s
coinflip (generic function with 1 method)
74.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
9.0 μs
outcome
1.1 μs

So, we got 6 heads and 4 tails.

+
8.1 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
5.2 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
5.1 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
5.9 μs
+
6.0 μs
- + - - + - - + - - - - - - - - - - - - - + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - -
2.2 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.0 μs
2.9 s
+
4.1 μs
3.0 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
14.9 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
15.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

4.9 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.2 μs
+
7.3 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
17.3 s

Now we define again our model just changing the distribution for p, as shown:

-
4.2 μs
coinflip_beta_prior (generic function with 1 method)
68.8 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
2.7 μs
2.8 s
+
17.8 s

Now we define again our model just changing the distribution for p, as shown:

+
4.1 μs
coinflip_beta_prior (generic function with 1 method)
89.2 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
2.7 μs
2.7 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - + - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + -
15.4 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
3.9 μs
+
15.3 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
4.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
570 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
5.0 μs

Summary

-

In this chapter, we gave an introduction to probabilistic programming languages exploring the classic coin flipping example in a Bayesian way.

-

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model setting our prior probability to a uniform distribution and the likelihood to a binomial one. So we sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

+
746 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
6.3 μs

Summary

+

In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way.

+

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

-
11.2 μs

References

+
35.1 μs
11.6 μs
+11.2 μs From 3b11ae0c9aa2f7eac69a53566ee6e6595e63168f Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 22 Mar 2021 22:42:39 -0300 Subject: [PATCH 08/13] Add feedback msg and next chapter link --- 05_prob_prog_intro/05_prob_prog_intro.jl | 20 + docs/05_prob_prog_intro.jl.html | 3636 +++++++++++----------- 2 files changed, 1852 insertions(+), 1804 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 7f7f26df..7a204872 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -255,6 +255,24 @@ md" * [Not a monad tutorial article about Soss.jl](https://notamonadtutorial.com/soss-probabilistic-programming-with-julia-6acc5add5549) " +# ╔═╡ a77d01ca-8b78-11eb-34dc-b18bf984bf22 +md" ### Give us feedback + + +This book is currently in a beta version. We are looking forward to getting feedback and criticism: + * Submit a GitHub issue **[here](https://github.com/unbalancedparentheses/data_science_in_julia_for_hackers/issues)**. + * Mail us to **martina.cantaro@lambdaclass.com** + +Thank you! +" + + +# ╔═╡ a8fda810-8b78-11eb-21af-d18ab85709b7 +md" +[Next chapter](https://datasciencejuliahackers.com/06_gravity.jl.html) +" + + # ╔═╡ Cell order: # ╟─9608da48-1acd-11eb-102a-27fbac1ec88c # ╟─14b8e170-1ad9-11eb-2111-3784e7029ba0 @@ -291,3 +309,5 @@ md" # ╟─f719af54-1af7-11eb-05d3-ff9aef8fb6ed # ╟─92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 # ╟─95efe302-35a4-11eb-17ac-3f0ad66fb164 +# ╟─a77d01ca-8b78-11eb-34dc-b18bf984bf22 +# ╟─a8fda810-8b78-11eb-21af-d18ab85709b7 diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index d55fa0f3..155bcae6 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
6.5 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

Probabilistic Programming

+
15.8 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.4 ms

Coin flipping example

-
5.5 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+
1.6 ms

Coin flipping example

+
5.0 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -164,1994 +164,2012 @@

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.4 ms
+
2.2 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.1 s

So how do we model the outcomes of flipping a coin?

+
18.5 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
12.7 μs
10.1 ms
47.6 μs
+
12.7 μs
12.8 ms
61.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
580 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+
571 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
6.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
10.7 μs
104 s
coinflip (generic function with 1 method)
74.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
6.9 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
10.3 μs
21.2 s
coinflip (generic function with 1 method)
63.3 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
8.1 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

+
8.0 μs
outcome
1.2 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
5.1 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+
5.2 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

How this single outcome have affected our beliefs about p?

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.0 μs
+
6.4 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.2 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.1 μs
3.0 s
+
3.1 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
4.0 μs
3.3 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - + - - - - - + - - - - - - - - - + - - + - - - - - - - - - - - - - - - - - - - - + + + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
15.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
4.9 μs

What if we wanted to include more previous knowledge about the success rate p?

+
15.7 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
5.2 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.3 μs
+
11.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
17.8 s

Now we define again our model just changing the distribution for p, as shown:

-
4.1 μs
coinflip_beta_prior (generic function with 1 method)
89.2 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
2.7 μs
2.7 s
+
16.9 s

Now we define again our model just changing the distribution for p, as shown:

+
5.4 μs
coinflip_beta_prior (generic function with 1 method)
72.9 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
2.8 μs
2.8 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
15.3 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
4.5 μs
+
15.0 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
9.0 μs
- + - - + - - + - - - - - - - - - - - - - + - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
746 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
6.3 μs

Summary

+
652 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
9.7 μs

Summary

In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way.

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

-
35.1 μs

References

+
13.1 μs
11.2 μs
+10.5 μs

Give us feedback

+

This book is currently in a beta version. We are looking forward to getting feedback and criticism:

+
    +
  • Submit a GitHub issue here.

    +
  • +
  • Mail us to martina.cantaro@lambdaclass.com

    +
  • +
+

Thank you!

+
2.3 ms
8.6 μs From d8891c40bf32a48d7ab7db4d0d21995f51603697 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Tue, 23 Mar 2021 11:23:34 -0300 Subject: [PATCH 09/13] added to do list on .jl file --- 05_prob_prog_intro/05_prob_prog_intro.jl | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 7a204872..c4138930 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -37,6 +37,14 @@ begin title!("Prior distribution for p") end +# ╔═╡ 3a8615d2-8be3-11eb-3913-b97d04bb053a +md"### To do list + +We are currently working on: + +"; + + # ╔═╡ 9608da48-1acd-11eb-102a-27fbac1ec88c md" # Probabilistic Programming" @@ -274,6 +282,7 @@ md" # ╔═╡ Cell order: +# ╟─3a8615d2-8be3-11eb-3913-b97d04bb053a # ╟─9608da48-1acd-11eb-102a-27fbac1ec88c # ╟─14b8e170-1ad9-11eb-2111-3784e7029ba0 # ╟─e8557342-1adc-11eb-3978-33880f4a97a2 From 129fd3ce1d607aa788298511f0cf479902d31c3b Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 12 Apr 2021 13:47:09 -0300 Subject: [PATCH 10/13] made corrections, html updated --- 05_prob_prog_intro/05_prob_prog_intro.jl | 66 +- docs/05_prob_prog_intro.jl.html | 3623 +++++++++++----------- 2 files changed, 1825 insertions(+), 1864 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index 7f7f26df..c25746e0 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -27,53 +27,54 @@ end # ╔═╡ 124ab9da-1ae4-11eb-1c1a-a3b9ac960130 using Turing -# ╔═╡ 9e8252ac-1af6-11eb-3e19-ddd43d9cd1a9 -begin - using StatsPlots - - plot(Beta(2,2), fill=(0, .5,:dodgerblue), ylim=(0,2), legend=false, size=(450, 300)) - xlabel!("p") - ylabel!("Probability") - title!("Prior distribution for p") -end - # ╔═╡ 9608da48-1acd-11eb-102a-27fbac1ec88c md" # Probabilistic Programming" # ╔═╡ 14b8e170-1ad9-11eb-2111-3784e7029ba0 md" -In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking. +In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking. + + + +We will start this chapter by discussing the fundamentals of another useful tool, that is, probabilistic programming, and more specifically, how to apply it using probabilistic programming languages or PPLs. +These are systems, usually embedded inside a programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically. -We will start this chapter discussing the fundamentals of another useful tool, that is, *Probabilisti Programming*, and more specifically, *Probabilistic Programming Languages* or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically. +In Julia, there are a few PPLs being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing on some examples to explain the general approach when using this tools. -In Julia, there are a few PPL's being developed, and we will be using two of them, *Turing.jl* and *Soss.jl*. We will be focusing in some examples to explain the general approach when using this tools. " # ╔═╡ e8557342-1adc-11eb-3978-33880f4a97a2 md" #### Coin flipping example" # ╔═╡ babb3d32-1adb-11eb-2f52-7799a2031c2b -md" We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective. +md" Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas. So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like: - Is getting heads as likely as getting tails? - Is our coin biased, preferring one output over the other? -To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia. +To answer these questions we are going to build a simple model, with the help of Julia libraries that add PPL capabilities. -Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any *prior* information about the problem? Since the plausibility of getting heads is formally a probability (let's call it *$p$*), we know it must lay between 0 and 1. +Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it $p$), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about $p$. This total uncertainty is also some kind of information we can incorporate in our model. How? -Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a *Uniform* prior distribution for $p$, and the domain of this function will be between 0 and 1. +Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for $p$, and the domain of this function will be between 0 and 1. + +Do we know anything else? +Let's skip that question for the moment and suppose we don't know anything else about *p*. +This complete uncertainty also constitutes information we can incorporate into our model. +How so? +Because we can assign equal probability to each value of *p* while assigning 0 probability to the remaining values. +This just means we don't know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for *p*, and the function domain will be all numbers between 0 and 1. " # ╔═╡ 10705938-1af3-11eb-27f4-a96a49a43a67 md" So how do we model the outcomes of flipping a coin? -Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability *p* of success –probability of heads–, these are called *Bernoulli* trials. The experiment of performing a number $N$ of Bernoulli trials, gives us the so called *Binomial distribution*. For a fixed value of $N$ and $p$, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, $N$ and $p$ are the *parameters* of our distribution. +Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability *p* of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number $N$ of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of $N$ and $p$, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, $N$ and $p$ are the parameters of our distribution. " # ╔═╡ 1ec4a39e-1af3-11eb-29e3-5d78e4c4613a @@ -83,17 +84,20 @@ Well, if we search for some similar type of experiment, we find that all process @bind p html"" # ╔═╡ 69ad2e2e-1af3-11eb-09e2-bd8e04190c9d -bar(Binomial(N,p), xlim=(0, 90), label=false, xlabel="Succeses", ylabel="Probability", title="Binomial distribution", color="green", alpha=0.8, size=(500, 350)) +begin + using StatsPlots + bar(Binomial(N,p), xlim=(0, 90), label=false, xlabel="Succeses", ylabel="Probability", title="Binomial distribution", color="green", alpha=0.8, size=(500, 350)) +end # ╔═╡ 889c04e2-1ae2-11eb-34c8-7b5a98d73676 md" So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, $N$ will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with $p$? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply. -When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our *Likelihood* function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called **sampling**, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. +When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. " # ╔═╡ 102b4be2-1ae4-11eb-049d-470a33703b49 -md"The model *coinflip* is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro *@model* previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. +md"The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro `@model` previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol. " @@ -115,7 +119,7 @@ end # ╔═╡ cbe1d1f2-1af4-11eb-0be3-b1a02280acf9 md" -*coinflip* receives the *N* outcomes of our flips, an array of lenght *N* with 0 or 1 values, 0 values indicating tails and 1 indicating heads. +coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter *p* and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter *p*, a success probability, shared for all the outcomes. Suppose we have run the experiment 10 times and had the outcomes: @@ -142,11 +146,11 @@ end; # ╔═╡ fa37b106-1af5-11eb-11df-97aada833767 md" -So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail. +So now we plot below the posterior distribution of *p* after our model updated, seeing just the first outcome, a 0 value or a tail. -How this single outcome have affected our beliefs about p? +How this single outcome have affected our beliefs about *p*? -We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success. +We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of *p* that suggest high rates of success. " # ╔═╡ 0c570210-1af6-11eb-1d5d-5f78f2a000fd @@ -187,6 +191,16 @@ md"What if we wanted to include more previous knowledge about the success rate * Let's say we know that the value of *p* is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of *p*. Then including this knowledge, our prior distribution for *p* will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below." +# ╔═╡ 9e8252ac-1af6-11eb-3e19-ddd43d9cd1a9 +begin + + + plot(Beta(2,2), fill=(0, .5,:dodgerblue), ylim=(0,2), legend=false, size=(450, 300)) + xlabel!("p") + ylabel!("Probability") + title!("Prior distribution for p") +end + # ╔═╡ d0f945f6-1af6-11eb-1f99-e79e7de8af80 md"Now we define again our model just changing the distribution for *p*, as shown:" @@ -207,7 +221,7 @@ begin end # ╔═╡ dc35ef50-1af6-11eb-2f77-d10aee2744dd -md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p." +md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of *p*." # ╔═╡ b924b40a-1af7-11eb-13a9-0368b23ab0a4 begin diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index d55fa0f3..e9684e7b 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -149,12 +149,12 @@ -

Probabilistic Programming

-
6.5 μs

In the previous chapter we introduced some of the basic mathematical tools we are going to make use through the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

-

We will start this chapter discussing the fundamentals of another useful tool, that is, Probabilisti Programming, and more specifically, Probabilistic Programming Languages or PPL's. These are systems, usually embedded inside some programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

-

In Julia, there are a few PPL's being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing in some examples to explain the general approach when using this tools.

-
1.4 ms

Coin flipping example

-
5.5 μs

We are going now to tackle a well known example, just to settle some ideas: flipping a coin. But this time, from a Bayesian perspective.

+

Probabilistic Programming

+
5.4 μs

In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+

We will start this chapter by discussing the fundamentals of another useful tool, that is, probabilistic programming, and more specifically, how to apply it using probabilistic programming languages or PPLs. These are systems, usually embedded inside a programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

+

In Julia, there are a few PPLs being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing on some examples to explain the general approach when using this tools.

+
6.7 μs

Coin flipping example

+
3.9 μs

Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -162,1996 +162,1973 @@
  • Is our coin biased, preferring one output over the other?

-

To answer these questions we are going to build a simple model, with the help of our probabilistic programming languages in Julia.

-

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-
2.4 ms
+

To answer these questions we are going to build a simple model, with the help of Julia libraries that add PPL capabilities.

+

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a Uniform prior distribution for p, and the domain of this function will be between 0 and 1.

+

Do we know anything else? Let's skip that question for the moment and suppose we don't know anything else about p. This complete uncertainty also constitutes information we can incorporate into our model. How so? Because we can assign equal probability to each value of p while assigning 0 probability to the remaining values. This just means we don't know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the function domain will be all numbers between 0 and 1.

+
6.4 ms
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
18.1 s

So how do we model the outcomes of flipping a coin?

-

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
12.7 μs
10.1 ms
47.6 μs
+
5.6 ms

So how do we model the outcomes of flipping a coin?

+

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called Binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

+
6.8 μs
14.1 ms
54.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
580 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

-

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
6.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
10.7 μs
104 s
coinflip (generic function with 1 method)
74.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
3.8 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a Binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov Chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

+
4.7 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
8.9 μs
21.6 s
coinflip (generic function with 1 method)
63.4 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
8.1 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

+
9.5 μs
outcome
1.2 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
5.1 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

-

How this single outcome have affected our beliefs about p?

-

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
6.0 μs
+
5.7 μs
2.0 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+

How this single outcome have affected our beliefs about p?

+

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

+
26.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.2 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.1 μs
3.0 s
+
2.3 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
6.1 μs
3.1 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - + - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
15.2 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
4.9 μs

What if we wanted to include more previous knowledge about the success rate p?

+
27.8 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

+
5.2 μs

What if we wanted to include more previous knowledge about the success rate p?

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a Beta distribution with parameters α=2 and β=2. It is ploted below.

-
7.3 μs
+
7.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
17.8 s

Now we define again our model just changing the distribution for p, as shown:

-
4.1 μs
coinflip_beta_prior (generic function with 1 method)
89.2 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
2.7 μs
2.7 s
+
2.9 ms

Now we define again our model just changing the distribution for p, as shown:

+
4.2 μs
coinflip_beta_prior (generic function with 1 method)
75.2 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

+
13.2 μs
2.9 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
15.3 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

+
16.3 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

4.5 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - - - - - - - - - - - -
746 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
6.3 μs

Summary

+
571 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
12.4 μs

Summary

In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way.

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

-
35.1 μs

References

+
13.1 μs
11.2 μs
+12.7 μs From 6027f90ebc385230c0f176c6f4991a5db665729d Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 12 Apr 2021 17:41:21 -0300 Subject: [PATCH 11/13] fixed typos, added github ribbons --- 05_prob_prog_intro/05_prob_prog_intro.jl | 30 +- docs/05_prob_prog_intro.jl.html | 3704 +++++++++++----------- 2 files changed, 1877 insertions(+), 1857 deletions(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index b1837674..e932fbca 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -70,7 +70,7 @@ Let's start thinking in a Bayesian way. The first thing we should ask ourselves Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about $p$. This total uncertainty is also some kind of information we can incorporate in our model. How? -Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for $p$, and the domain of this function will be between 0 and 1. +Because we can assign equal probability for each value of $p$ between 0 and 1, while assigning 0 probability for the remaining values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for $p$, and the domain of this function will be between 0 and 1. Do we know anything else? Let's skip that question for the moment and suppose we don't know anything else about *p*. @@ -103,11 +103,11 @@ end md" So, as we said, we are going to assume that our data (the count number of heads) is generated by a binomial distribution. Here, $N$ will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with $p$? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply. -When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. +When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of $p$, how likely it is that our data is generated by the Bernoulli distribution with that value of $p$. How do we choose $p$? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of $p$ samples taken from our prior distribution, gives us the posterior distribution of $p$ given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of these algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner. " # ╔═╡ 102b4be2-1ae4-11eb-049d-470a33703b49 -md"The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro `@model` previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. +md"The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro `@model` previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol. " @@ -129,8 +129,8 @@ end # ╔═╡ cbe1d1f2-1af4-11eb-0be3-b1a02280acf9 md" -coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. -The idea is that with each new value of outcome, the model will be updating its believes about the parameter *p* and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter *p*, a success probability, shared for all the outcomes. +coinflip receives the N outcomes of our flips, an array of length N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. +The idea is that with each new value of outcome, the model will be updating its beliefs about the parameter *p* and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter *p*, a success probability, shared for all the outcomes. Suppose we have run the experiment 10 times and had the outcomes: " @@ -158,9 +158,9 @@ end; md" So now we plot below the posterior distribution of *p* after our model updated, seeing just the first outcome, a 0 value or a tail. -How this single outcome have affected our beliefs about *p*? +How this single outcome affected our beliefs about *p*? -We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of *p* that suggest high rates of success. +We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a failure, so it lowers the probability for values of *p* that suggest high rates of success. " # ╔═╡ 0c570210-1af6-11eb-1d5d-5f78f2a000fd @@ -174,7 +174,7 @@ end # ╔═╡ 44037220-1af6-11eb-0fee-bbc3f71f6c08 md" -Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of *p* adding outcomes to our model updating its beliefs. +Let's continue now including the remaining outcomes and see how the model is updated. We have plotted below the posterior probability of *p* adding outcomes to our model updating its beliefs. " # ╔═╡ 5280080e-1af6-11eb-3137-75116ca79102 @@ -194,12 +194,13 @@ begin end # ╔═╡ 849c78fe-1af6-11eb-20f0-df587758e966 -md" We see that with each new value the model believes more and more that the value of *p* is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of *p* in between, being the values near 0.5 more plausible with each update." +md" We see that with each new value the model believes more and more that the value of *p* is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values of *p* in between, being the values near 0.5 more plausible with each update." # ╔═╡ 9150c71e-1af6-11eb-1036-8b1b45ed95c4 md"What if we wanted to include more previous knowledge about the success rate *p*? -Let's say we know that the value of *p* is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of *p*. Then including this knowledge, our prior distribution for *p* will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a beta distribution with parameters α=2 and β=2. It is ploted below." +Let's say we know that the value of *p* is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of *p*. Then including this knowledge, our prior distribution for *p* will have higher probability for values near 0.5, and low probability for values near 0 or 1. +Searching again in our repertoire of distributions, one that fulfills our wishes is a beta distribution with parameters α=2 and β=2. It is plotted below." # ╔═╡ 9e8252ac-1af6-11eb-3e19-ddd43d9cd1a9 begin @@ -231,7 +232,7 @@ begin end # ╔═╡ dc35ef50-1af6-11eb-2f77-d10aee2744dd -md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of *p*." +md"Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximation of *p*." # ╔═╡ b924b40a-1af7-11eb-13a9-0368b23ab0a4 begin @@ -250,13 +251,16 @@ begin end # ╔═╡ e407fa56-1af7-11eb-18c2-79423a9e4135 -md" To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that *p* near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher." +md" To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that *p* near 0 and 1 have less probability, it catches up faster that probabilities near 0.5 are higher." # ╔═╡ efa0b506-1af7-11eb-2a9a-cb08f7f2d715 plot(plots[3], plots_[3], title = ["Posterior for uniform prior and 4 outcomes" "Posterior for beta prior and 4 outcomes"], titleloc = :center, titlefont = font(8), layout=2, size=(450, 300)) # ╔═╡ f719af54-1af7-11eb-05d3-ff9aef8fb6ed -md"So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for *p*, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about *p* so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown." +md"So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for *p*, needing less outcomes to reach a very similar posterior distribution. +When we used a uniform prior, we were conservative, meaning that we said we didn't know anything about *p* so we assign equal probability for all values. +Sometimes this kind of distribution (uniform distributions), called a non-informative prior, can be too conservative, being in some cases not helpful at all. +They even can slow the convergence of the model to the more plausible values for our posterior, as shown." # ╔═╡ 92a7cfaa-1a2e-11eb-06f2-f50e91cfbba0 md" ### Summary diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html index 151960b4..caddc755 100644 --- a/docs/05_prob_prog_intro.jl.html +++ b/docs/05_prob_prog_intro.jl.html @@ -9,6 +9,7 @@ + -
10.2 μs

Probabilistic Programming

-
5.8 μs

In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

+
+ Fork me on GitHub + +
18.0 μs

Probabilistic Programming

+
7.1 μs

In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

We will start this chapter by discussing the fundamentals of another useful tool, that is, probabilistic programming, and more specifically, how to apply it using probabilistic programming languages or PPLs. These are systems, usually embedded inside a programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

In Julia, there are a few PPLs being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing on some examples to explain the general approach when using this tools.

-
8.1 μs

Coin flipping example

-
7.5 μs

Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas.

+
8.2 μs

Coin flipping example

+
6.3 μs

Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas.

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

  • Is getting heads as likely as getting tails?

    @@ -163,1990 +167,1984 @@

To answer these questions we are going to build a simple model, with the help of Julia libraries that add PPL capabilities.

-

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaing values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the domain of this function will be between 0 and 1.

+

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaining values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the domain of this function will be between 0 and 1.

Do we know anything else? Let's skip that question for the moment and suppose we don't know anything else about p. This complete uncertainty also constitutes information we can incorporate into our model. How so? Because we can assign equal probability to each value of p while assigning 0 probability to the remaining values. This just means we don't know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the function domain will be all numbers between 0 and 1.

-
4.8 ms
+
34.8 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
19.6 s

So how do we model the outcomes of flipping a coin?

+
20.6 s

So how do we model the outcomes of flipping a coin?

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.2 μs
11.2 ms
92.1 μs
+
11.2 μs
10.5 ms
59.1 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
601 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

-

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of this algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
4.4 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of the all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
2.9 μs
21.4 s
coinflip (generic function with 1 method)
67.2 μs

coinflip receives the N outcomes of our flips, an array of lenght N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its believes about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

+
613 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

+

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of these algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

+
11.5 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

+
13.9 μs
21.2 s
coinflip (generic function with 1 method)
68.3 μs

coinflip receives the N outcomes of our flips, an array of length N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its beliefs about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

Suppose we have run the experiment 10 times and had the outcomes:

-
20.3 μs
outcome
1.1 μs

So, we got 6 heads and 4 tails.

+
17.8 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
6.4 μs
2.0 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

-

How this single outcome have affected our beliefs about p?

-

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a faliure, so it lowers the probability for values of p that suggest high rates of success.

-
9.1 μs
+
5.2 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

+

How this single outcome affected our beliefs about p?

+

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a failure, so it lowers the probability for values of p that suggest high rates of success.

+
23.9 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
3.0 ms

Let's continue now including the remainig outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
4.3 μs
3.2 s
+
2.2 ms

Let's continue now including the remaining outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

+
16.1 μs
2.9 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - + - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - + - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - + - - - - + - - - - - - - - - - - - - - - - -
18.6 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values the of p in between, being the values near 0.5 more plausible with each update.

-
5.1 μs

What if we wanted to include more previous knowledge about the success rate p?

-

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Seaching again in our repertoire of distributions, one that fulfill our wishes is a beta distribution with parameters α=2 and β=2. It is ploted below.

-
6.9 μs
+
14.7 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values of p in between, being the values near 0.5 more plausible with each update.

+
15.2 μs

What if we wanted to include more previous knowledge about the success rate p?

+

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Searching again in our repertoire of distributions, one that fulfills our wishes is a beta distribution with parameters α=2 and β=2. It is plotted below.

+
35.3 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - -
1.4 s

Now we define again our model just changing the distribution for p, as shown:

-
4.2 μs
coinflip_beta_prior (generic function with 1 method)
73.3 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximations for the value of p.

-
4.0 μs
3.2 s
+
1.3 s

Now we define again our model just changing the distribution for p, as shown:

+
4.8 μs
coinflip_beta_prior (generic function with 1 method)
87.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximation of p.

+
14.9 μs
3.0 s
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - + - - - - - - - - - + - - - - - - - - + - - - - + - - - - - - - - - - - - - - - - - - + - - - - - + - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - -
16.9 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catchs up faster that probabilities near 0.5 are higher.

-
4.1 μs
+
17.6 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catches up faster that probabilities near 0.5 are higher.

+
28.2 μs
- + - - + - - + - - - - - - - - - - - - - - - - - - - - - - + + - - - - - - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
691 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used an uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes these kind of distribution (uniform distributions), called a non-informative prior, can be maybe too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
4.9 μs

Summary

+
641 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used a uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes this kind of distribution (uniform distributions), called a non-informative prior, can be too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

+
14.0 μs

Summary

In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way.

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

-
5.7 μs

References

+
6.0 μs
9.9 μs

Give us feedback

+
10.5 μs

Give us feedback

This book is currently in a beta version. We are looking forward to getting feedback and criticism:

  • Submit a GitHub issue here.

    @@ -3623,8 +3639,8 @@

Thank you!

-
25.4 μs
7.7 μs
+12.8 μs6.3 μs From 5ec51fccd12cddd3f35e71d7b6876717347495a1 Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 30 Aug 2021 18:52:54 -0300 Subject: [PATCH 12/13] delete .html modifications --- docs/05_prob_prog_intro.jl.html | 3647 ------------------------------- 1 file changed, 3647 deletions(-) delete mode 100644 docs/05_prob_prog_intro.jl.html diff --git a/docs/05_prob_prog_intro.jl.html b/docs/05_prob_prog_intro.jl.html deleted file mode 100644 index caddc755..00000000 --- a/docs/05_prob_prog_intro.jl.html +++ /dev/null @@ -1,3647 +0,0 @@ - - - - - ⚡ Pluto.jl ⚡ - - - - - - - - - - - -
- Fork me on GitHub - -
18.0 μs

Probabilistic Programming

-
7.1 μs

In the previous chapters we introduced some of the basic mathematical tools we are going to make use of throughout the book. We talked about histograms, probability, probability distributions and the Bayesian way of thinking.

-

We will start this chapter by discussing the fundamentals of another useful tool, that is, probabilistic programming, and more specifically, how to apply it using probabilistic programming languages or PPLs. These are systems, usually embedded inside a programming language, that are designed for building and reasoning about Bayesian models. They offer scientists an easy way to define probability models and solving them automatically.

-

In Julia, there are a few PPLs being developed, and we will be using two of them, Turing.jl and Soss.jl. We will be focusing on some examples to explain the general approach when using this tools.

-
8.2 μs

Coin flipping example

-
6.3 μs

Let's revisit the old example of flipping a coin, but from a Bayesian perspective, as a way to lay down some ideas.

-

So the problem goes like this: Suppose we flip a coin N times, and we ask ourselves some questions like:

-
    -
  • Is getting heads as likely as getting tails?

    -
  • -
  • Is our coin biased, preferring one output over the other?

    -
  • -
-

To answer these questions we are going to build a simple model, with the help of Julia libraries that add PPL capabilities.

-

Let's start thinking in a Bayesian way. The first thing we should ask ourselves is: Do we have any prior information about the problem? Since the plausibility of getting heads is formally a probability (let's call it p), we know it must lay between 0 and 1. Do we know anything more? Let's skip that question for the moment and suppose we don't know anything more about p. This total uncertainty is also some kind of information we can incorporate in our model. How? Because we can assign equal probability for each value of p between 0 and 1, while assigning 0 probability for the remaining values. This just means we don't know anything and that every outcome is equally possible. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the domain of this function will be between 0 and 1.

-

Do we know anything else? Let's skip that question for the moment and suppose we don't know anything else about p. This complete uncertainty also constitutes information we can incorporate into our model. How so? Because we can assign equal probability to each value of p while assigning 0 probability to the remaining values. This just means we don't know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the function domain will be all numbers between 0 and 1.

-
34.8 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
20.6 s

So how do we model the outcomes of flipping a coin?

-

Well, if we search for some similar type of experiment, we find that all processes in which we have two possible outcomes –heads or tails in our case–, and some probability p of success –probability of heads–, these are called Bernoulli trials. The experiment of performing a number N of Bernoulli trials, gives us the so called binomial distribution. For a fixed value of N and p, the Bernoulli distribution gives us the probability of obtaining different number of heads (and tails too, if we know the total number of trials and the number of times we got heads, we know that the remaining number of times we got tails). Here, N and p are the parameters of our distribution.

-
11.2 μs
10.5 ms
59.1 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
613 ms

So, as we said, we are going to assume that our data (the count number of heads) is generated by a binomial distribution. Here, N will be something we know. We control how we are going to make our experiment, and because of this, we fix this parameter. The question now is: what happens with p? Well, this is actually what we want to estimate! The only thing we know until now is that it is some value from 0 to 1, and every value in that range is equally likely to apply.

-

When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function. This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p. How do we choose p? Well, actually we don't. You just let randomness make it's choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it's particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data. This is called sampling, and it is a very important concept in Bayesian statistics and probabilistic programming, as it is one of the fundamental tools that makes all the magic under the hood. It is the method that actually lets us update our beliefs. The general family of algorithms that follow the steps we have mentioned are named Markov chain Monte Carlo (MCMC) algorithms. The computing complexity of these algorithms can get very high as the complexity of the model increases, so there is a lot of research being done to find intelligent ways of sampling to compute posterior distributions in a more efficient manner.

-
11.5 μs

The model coinflip is shown below. It is implemented using the Turing.jl library, which will be handling all the details about the relationship between the variables of our model, our data and the sampling and computing. To define a model we use the macro @model previous to a function definition as we have already done. The argument that this function will recieve is the data from our experiment. Inside this function, we must write the explicit relationship of all the variables involved in a logical way. Stochastic variables –variables that are obtained randomly, following a probability distribution–, are defined with a '~' symbol, while deterministic variables –variables that are defined deterministically by other variables–, are defined with a '=' symbol.

-
13.9 μs
21.2 s
coinflip (generic function with 1 method)
68.3 μs

coinflip receives the N outcomes of our flips, an array of length N with 0 or 1 values, 0 values indicating tails and 1 indicating heads. The idea is that with each new value of outcome, the model will be updating its beliefs about the parameter p and this is what the for loop is doing: we are saying that each outcome comes from a Bernoulli distribution with a parameter p, a success probability, shared for all the outcomes.

-

Suppose we have run the experiment 10 times and had the outcomes:

-
17.8 μs
outcome
1.3 μs

So, we got 6 heads and 4 tails.

-

Now, we are going to see now how the model, for our unknown parameter p, is updated. We will start by giving only just one input value to the model, adding one input at a time. Finally, we will give the model all outcomes values as input.

-
5.2 μs
1.9 s

So now we plot below the posterior distribution of p after our model updated, seeing just the first outcome, a 0 value or a tail.

-

How this single outcome affected our beliefs about p?

-

We can see in the plot below, showing the posterior or updated distribution of p, that the values of p near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a failure, so it lowers the probability for values of p that suggest high rates of success.

-
23.9 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2.2 ms

Let's continue now including the remaining outcomes and see how the model is updated. We have plotted below the posterior probability of p adding outcomes to our model updating its beliefs.

-
16.1 μs
2.9 s
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
14.7 ms

We see that with each new value the model believes more and more that the value of p is far from 0 or 1, because if it was the case we would have only heads or tails. The model prefers values of p in between, being the values near 0.5 more plausible with each update.

-
15.2 μs

What if we wanted to include more previous knowledge about the success rate p?

-

Let's say we know that the value of p is near 0.5 but we are not so sure about the exact value, and we want the model to find the plausibility for the values of p. Then including this knowledge, our prior distribution for p will have higher probability for values near 0.5, and low probability for values near 0 or 1. Searching again in our repertoire of distributions, one that fulfills our wishes is a beta distribution with parameters α=2 and β=2. It is plotted below.

-
35.3 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1.3 s

Now we define again our model just changing the distribution for p, as shown:

-
4.8 μs
coinflip_beta_prior (generic function with 1 method)
87.1 μs

Running the new model and plotting the posterior distributions, again adding one observation at a time, we see that with less examples we have a better approximation of p.

-
14.9 μs
3.0 s
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
17.6 ms

To illustrate the affirmation made before, we can compare for example the posterior distributions obtained only with the first 4 outcomes for both models, the one with a uniform prior and the other with the beta prior. The plots are shown below. We see that some values near 0 and 1 have still high probability for the model with a uniform prior for p, while in the model with a beta prior the values near 0.5 have higher probability. That's because if we tell the model from the beginning that p near 0 and 1 have less probability, it catches up faster that probabilities near 0.5 are higher.

-
28.2 μs
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
641 μs

So in this case, incorporating our beliefs in the prior distribution we saw the model reached faster the more plausible values for p, needing less outcomes to reach a very similar posterior distribution. When we used a uniform prior, we were conservative, meaning that we said we didn't know anything about p so we assign equal probability for all values. Sometimes this kind of distribution (uniform distributions), called a non-informative prior, can be too conservative, being in some cases not helpful at all. They even can slow the convergence of the model to the more plausible values for our posterior, as shown.

-
14.0 μs

Summary

-

In this chapter, we gave an introduction to probabilistic programming languages and explored the classic coin flipping example in a Bayesian way.

-

First, we saw that in this kind of Bernoulli trial scenario, where the experiment has two possible outcomes 0 or 1, it is a good idea to set our likelihood to a binomial distribution. We also learned what sampling is and saw why we use it to update our beliefs. Then we used the Julia library Turing.jl to create a probabilistic model, setting our prior probability to a uniform distribution and the likelihood to a binomial one. We sampled our model with the Markov chain Monte Carlo algorithm and saw how the posterior probability was updated every time we input a new coin flip result.

-

Finally, we created a new model with the prior probability set to a normal distribution centered on p = 0.5 which gave us more accurate results.

-
6.0 μs
10.5 μs

Give us feedback

-

This book is currently in a beta version. We are looking forward to getting feedback and criticism:

-
    -
  • Submit a GitHub issue here.

    -
  • -
  • Mail us to martina.cantaro@lambdaclass.com

    -
  • -
-

Thank you!

-
12.8 μs
6.3 μs
- - - - \ No newline at end of file From f241b1b94b4559bd003e002cd9cb73057776f2fc Mon Sep 17 00:00:00 2001 From: Pedro Fontana Date: Mon, 30 Aug 2021 18:55:28 -0300 Subject: [PATCH 13/13] modify .jl --- 05_prob_prog_intro/05_prob_prog_intro.jl | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/05_prob_prog_intro/05_prob_prog_intro.jl b/05_prob_prog_intro/05_prob_prog_intro.jl index e932fbca..0eb8ecd8 100644 --- a/05_prob_prog_intro/05_prob_prog_intro.jl +++ b/05_prob_prog_intro/05_prob_prog_intro.jl @@ -158,7 +158,7 @@ end; md" So now we plot below the posterior distribution of *p* after our model updated, seeing just the first outcome, a 0 value or a tail. -How this single outcome affected our beliefs about *p*? +How this single outcome affected our beliefs about *p*? We can see in the plot below, showing the posterior or updated distribution of *p*, that the values of *p* near to 0 have more probability than before, recalling that all values had the same probability, which makes sense if all our model has seen is a failure, so it lowers the probability for values of *p* that suggest high rates of success. "