tutorial_BT.Rmd

---
title: "Bayesian workflow for fitting SEIR models using the BayesianTools package"
author: "Fabienne Krauer"
date: "2022-09-03"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Intro

### The Bayesian workflow

The [Bayesian workflow](https://arxiv.org/abs/2011.01808) was proposed by Andrew Gelman, Aki Vehtari and others, and describes the process of model building, inference and model checking/improvement. The Bayesian workflow is not a straight line but an iterative process of checking and refining or discarding components of the model. 

Why should I follow the workflow? Bayesian inference for IDD models is hard, and most models are too complex to be fitted without any problems. The workflow helps to narrow down where the problem is and may help to improve inference. 

The process can be summarised as follows (for a graphical representation, see Gelman's publication):

1. Define a model
2. Define the prior and perform a prior predictive check, modify the prior if necessary
3. Fit the model
4. Validate the computation (diagnostics)
5. Address potential issues with computation (MCMC)
6. Evaluate the model fit: posterior predictive intervals
7. Modify the model and/or the prior if necessary, or add/change the data
8. Compare models

We will go through some of these steps with an easy example below.


### The BayesianTools package

[BayesianTools](https://cran.r-project.org/web/packages/BayesianTools/BayesianTools.pdf) is a general purpose R package, which offers various MCMC and SMC samplers, as well as plot and diagnostic functions for Bayesian statistics, with a particular focus on calibrating complex system models. BT is a modular package, where the user defines the likelihood function and the priors separately. It does not require the use of domain-specific language (such as Stan). 


Samplers:
1) standard Metropolis-Hastings (MH)
2) adaptive Metropolis-Hastings
3) standard or adaptive Metropolis-Hastings with delayed rejection
4) standard Metropolis-Hastings with Gibbs updating
5) standard Metropolis-Hastings with Gibbs updating and tempering
6) Differential Evolution (DE), optional with snooker update (DEzs)
7) Differential Evolution Adaptive Metropolis (DREAM) 
8) t-walk MCMC
9) Sequential Monte Carlo (SMC)

Delayed rejection sampling means that a second (or third, etc.) proposal is made before rejection. This second proposal is drawn from a different distribution, allowing for a greater flexibility of the sampler. 

Gibbs sampling means that in each iteration, only a subset of the parameters is updated. 

Tempering is similar to simulated annealing, and means that the acceptance rate is increased during the burnin phase. This should help to speed up convergence. 

Differential Evolution algorithms are genetic algorithms with multiple chains, which inform each other during the burnin phase. If in doubt which one to choose, prefer "DEzs". The DE sampler uses the current position of two other chains to generate the proposal for each chain. DEzs also uses past states of the two other chains in the creation of the proposal. This allows for fewer chains (i.e. 2x3 chains are usually enough for up to 200 parameters).  
                                                                         
                                                                         
## WORKFLOW

First we load the required libraries and set a seed:
```{r}
library(deSolve)
library(tidyverse)
library(BayesianTools)
library(coda)
set.seed(42)
```

### Model

We start by defining the model. This is a simple SEIR model without demographic changes and permanent immunity:

```{r model_SEIRS}
model_SEIR <- function(times, inits, theta) {
  
  SEIR <- function(times, inits, theta) {
    
    S = inits[["S"]]
    E = inits[["E"]]
    I = inits[["I"]]
    R = inits[["R"]]
    N = S + E + I + R
    
    beta <- theta[["beta"]]
    sigma <- theta[["sigma"]]
    gamma <- theta[["gamma"]]

    dS <- -beta*S*I/N 
    dE <- beta*S*I/N - sigma*E
    dI <- sigma*E - gamma*I
    dR <- gamma*I
    dC <- beta*S*I/N
    list(c(dS, dE, dI, dR, dC))
    
  }  
  
  traj <- data.frame(lsoda(inits, times, SEIR, theta))
  # Calculate the incidence per time step from the cumulative state:
  traj$inc <- c(inits["I"], diff(traj$C))
  return(traj)
  
}
```

Next, we define the parameters and the initial conditions for the model. The initial conditions assume that everyone is susceptible, except one person, which is the first infectious individual:
```{r params}
beta <- 0.55 # transmission rate
sigma <- 1/5 # =1/duration of latency
gamma <- 1/5 # = 1/duration of infectiousness
theta <- c(beta=beta, sigma=sigma, gamma=gamma) # combine in one vector "theta"

inits <- c("S"=10000-1, "E"=0, "I"=1, "R"=0, "C"=1) 

```

We the define the time points at which the ODE solution is saved. Note, the unit of the time steps MUST BE the same as the unit of your parameters, i.e. if $\gamma$ = 1/5 days, then one time step (e.g. 1 to 2) is 1 day.
Also remember that integration is computationally costly, so save only the time steps you need. 

```{r}
times <- seq(1:100)
```

Run model once and plot the trajectory:

```{r}
traj <- model_SEIR(times, inits, theta)

```

This is a closed population model. It is always a good idea to do a "sanity check": the sum of all states must be constant and equivalent to the total number of individuals. If it increases or decreases, there is probably a mistake in your model code. 

```{r}
summary(rowSums(traj[,c(2:5)]))
```

```{r, echo=FALSE}
ggplot(traj) +
  geom_line(aes(x=time, y=inc))

```


### Data

We now generate some data from the model ("synthetic data") for the fitting process. The advantage of simulated data is that you know the "true" parameter values that generated the data, which makes learning about inference easier. 

Since case data are commonly count data, we here assume that the observation process is a Poisson process:

```{r data}
data <- data.frame(obs=sapply(traj$inc, function(x) rpois(1, x)),
                   time = times)

```

and plot the generated data and the modelled trajectory:

```{r}
ggplot() + 
  geom_line(data=traj, aes(x=time, y=inc)) +
  geom_point(data=data, aes(x=time, y=obs))

```

### Priors

We now define the prior distributions. For convencience, we define the lower and upper limits for ALL parameter, even if we don't fit them all initially. You should carefully think about the bounds of each parameter, and whether they make sense from a mechanistic/biological point of view. 

```{r}
# Estimated params
estpars <- c("beta", "sigma") # parameters to estimate, can be modified
index <- which(names(theta) %in% estpars) # index of estimated params

# Priors
lower = c("beta"=0, "sigma"=0)
upper = c("beta"=0.5, "sigma"=0.1)

```

We then create a prior object. BT has a function to `createUniformPrior()` to generate uniform priors for all parameters. It is a good starting point, but you often need to define other distributions for some priors. 

```{r}
# Create uniform prior
prior_unif <- createUniformPrior(lower=lower, 
                                 upper=upper)

```


### Prior predictive check

Before we do any fitting, it is often useful to check whether the prior distributions you have chosen actually generate a model trajectory that overlaps with the data. If your prior is poorly specified, getting a posterior is not possible. We can do this by sampling from the prior and forward simulating the model trajectory with the sampled values. This is called the prior predictive check. The following function does this (cave: it's slow and not optimised) and summarises the min and max trajectory values for each time step in the model. (You can also choose to sample `prob = c(0.025, 0.975)`, which will result in a 95% interval). The observation process (here a Poisson) is also included to account for the observation uncertainty: 


```{r}
sample_prior_Pois <- function(prior, model, times, inits, 
                       theta, index, estpars, nsims) {
  
  traj <- vector("list", nsims)
  for (i in 1:nsims) {
    
    theta_est <- prior$sampler() # sample from the prior
    names(theta_est) <- estpars
    theta_updated <- c(theta[-index], theta_est)
    inc <- match.fun(model)(times, inits, theta_updated)[,c("inc")]
    inc_obs <- sapply(inc, function(x) rpois(1, x))
    traj[[i]] <- inc_obs
  }
  
  trajdf <- t(plyr::ldply(traj, rbind))
  quantiles <- data.frame(t(apply(trajdf, 1, quantile, prob = c(0.0, 1.0))))
  quantiles$time <- times
  colnames(quantiles)[1:2] <- c("min", "max")

  return(quantiles)
}


```

Let's sample and plot:

```{r}
priorpred <- sample_prior_Pois(prior_unif, model_SEIR, times, inits, 
                       theta, index, estpars, nsims=500)

ggplot() + 
  geom_point(data=data, aes(x=time, y=obs)) +
  geom_ribbon(data=priorpred, aes(x=time, ymin=min, ymax=max), alpha=0.5, fill="red")
 
```

The prior predictive does not include all the data points, which is no surprise given that the "true" parameter values are outside the prior range.

Let's update the prior ranges and redo the prior predictive:

```{r}
# Priors
lower = c("beta"=0, "sigma"=0)
upper = c("beta"=5.0, "sigma"=1.0)

prior_unif <- createUniformPrior(lower=lower, 
                                 upper=upper)


priorpred <- sample_prior_Pois(prior_unif, model_SEIR, times, inits, 
                       theta, index, estpars, nsims=500)

ggplot() + 
  geom_point(data=data, aes(x=time, y=obs)) +
  geom_ribbon(data=priorpred, aes(x=time, ymin=min, ymax=max), alpha=0.5, fill="red")
 

```

This time the data range is covered in the prior predictive, meaning that the prior boundaries are well chosen. The prior predictive also reveals that a lot of parameter samples will lead to trajectories far away from the data. This can potentially make the inference process more challenging if you have a large number of parameters to estimate. You may have to adjust the prior in the process of the fitting, to facilitate the sampling. 


### Likelihood function

The log-likelihood function is the core of Bayesian inference. Again, we are assuming the observations arise from a Poisson process (count data). In BT, you need to define a function that takes the parameters and returns a single log-likelihood value:


```{r}
ll_Pois <- function(model, theta, inits, times, data) {
  
  traj <- match.fun(model)(times, inits, theta)
  datapoint <- data$obs
  modelpoint <- traj$inc


  if (any(is.na(modelpoint))) {
    ll <- -Inf
  } 
  
  
  else { # Minimize the log likelihood
    ll <- sum(dpois(x=datapoint, 
                    lambda=modelpoint,
                    log=TRUE), na.rm=TRUE)
  }
  
  return(ll)
}

```

You should always test the LL function:

```{r}
# test
ll_Pois(model_SEIR, theta, inits, times, data)

```

We also have to define a wrapper function for the log-likelihood function, because BT requires that the LL-function takes only one argument, which are the estimated parameters. Remember to adjust this function when you change the names of the arguments of the likelihood function:

```{r}
ll_Pois_wrapper <- function(par) {
  
  parX = theta
  parX[index] = par

  return(ll_Pois(model=model_SEIR,
                 theta=parX, 
                  inits=inits, 
                  times=times, 
                  data=data))
}    

# Test (should return the same as ll_pois())
ll_Pois_wrapper(theta[index])

```

### Set up and run the MCMC

Finally, we define the setup for the sampler and run the MCMC. We here choose the simplest sampler (standard Metropolis Hastings), and run 3 chains with 20_000 iterations each. It is not always easy to determine the number of iterations to run, the best thing to do is to start with something like 20_000 iterations (for random walk samplers only. For gradient-based samplers like HMC/NUTS you can start with 1000). We also want to run more than one chain to determine convergence. 

Note that I have suppressed the messages to the console for this tutorial with `message = FALSE`, but you should have message enabled (the default) to monitor the progress of the fitting. 

```{r}
mcmc_settings1 <- list(iterations = 20000, 
                      nrChains = 3,
                      message = FALSE,
                      optimize=FALSE, 
                      adapt=FALSE)

bayesianSetup_unif <- createBayesianSetup(prior = prior_unif,
                                     likelihood = ll_Pois_wrapper,
                                     names = names(theta[index]),
                                     parallel = FALSE)

system.time({chain_unif1 <- runMCMC(bayesianSetup = bayesianSetup_unif, 
                              sampler = "Metropolis", 
                              settings = mcmc_settings1)})


```

### Diagnostics

The first thing you do is to visually inspect the trace plots:

```{r}
plot(chain_unif1)
```

The chains seem to have converged, but the mixing is very poor (no caterpillar), the chains stick too long in one place suggesting that the proposal is poorly chosen. 

The summary also shows that the rejection rate is very high:

```{r}
summary(chain_unif1)
```

We will thus try a adaptive MCMC, which alters the proposal distribution during the fit, and also extend the iterations:

```{r}
mcmc_settings2 <- list(iterations = 50000, 
                      nrChains = 3,
                      message = FALSE,
                      optimize=FALSE, 
                      adapt=TRUE)


system.time({chain_unif2 <- runMCMC(bayesianSetup = bayesianSetup_unif, 
                              sampler = "Metropolis", 
                              settings = mcmc_settings2)})


```

```{r}
plot(chain_unif2)
```

Including the adaptation of the proposal has improved the mixing substantially. We remove the first 10_000 iterations (burn-in phase):

```{r}
nburn <- 15000
plot(chain_unif2, parametersOnly = TRUE, start =nburn)

```

```{r}
summary(chain_unif2)
```
The rejection rate is much better and the effective sample size (ESS) has also increased. The ESS is an estimate for the number of independent samples (taking into account autocorrelation of the individual samples) generated by the MCMC run. There is no universal threshold for how large the ESS should be, but you should aim for at least 100 independent samples.  


We then check convergence mathematically with the PSRF. The PSRF gives the scale reduction factor for each parameter. A PSRF of 1 means that variance between and within the chains are equal, which is the goal. As a rule of thumb, a if PSRF < 1.05, convergence is very likely. But you should always also check the traceplots

```{r}
gelmanDiagnostics(chain_unif2, plot=FALSE, start=nburn)

```

It is also advised to check parameter correlations. Sometimes these correlations can make inference more challenging, i.e. the posterior is too complex with several local modes. In that case, you may need to re-formulate (reparametrize) the model or run the chains for longer. However, ODE models are difficult to reparametrize. 

Depending on what the purpose of the model inference is, parameter correlation is a larger or smaller problem. If we are not too concerned about the mechanistic meaning of the parameters, we may accept some degree of parameter correlation. If we are intersted in learning about the mechanisms underlying the disease data, we may need to add additional data for the fitting and/or change the priors to narrow down the parameter space and reduce correlation. 

```{r}
correlationPlot(chain_unif2, start=nburn)

```

In this example, the parameters are strongly correlated, but 
1) We still have convergence of the chains
2) We know the posteriors are correct
3) We cannot reparametrize this model.

Thus, we will leave the model as it is and accept the parameter correlation. However, in a real life study, we would want to reduce this correlation. 

We can also check the summary statistics of the marginal posteriors. For this, we need to convert the chain to a coda object and summarize. For a description of the marginal posteriors, we commonly use the median and the equal tailed 95% CrI (credible interval), which is the 2.5% and 97.5% percentile: 

```{r}
coda2 <- getSample(chain_unif2, parametersOnly = TRUE, coda=TRUE, start=nburn)
summary(coda2)

```

Alternatively, you can choose the highest posterior density interval. We have to combine the chains to get an overall estimate:

```{r}
HPDinterval(mcmc(do.call(rbind, coda2)))

```

## Evaluating the model fit

If the chains are healthy and converged, and the marginal posterior distributions are reasonable, we can check the fit to the data by evaluating the posterior predictive. The posterior predictive distribution is generated by taking random draws from the posterior and forward simulating the model trajectory. The simulated trajectories are then summarised over each time step in the model by calculating the percentiles, e.g. 2.5th and 97.5th percentile. The resulting uncertainty is called the (95%) posterior predictive interval. 

The following function calculates the quantiles from a given model and posterior:

```{r}
sample_posterior_Pois <- function(chain, theta, inits, times, model, ndraw, nburn, progress) {
  
  #Draw n fitted parameter vectors theta from the MCMC object
  sample <- getSample(chain, parametersOnly = TRUE, thin=1, numSamples=ndraw, start=nburn)
  
  fit <- plyr::adply(.data=sample, .margins=1, .progress=progress, .parallel=F, .fun=function(x) {
    
    #Define the theta as the estimated parameters from the current draw and the fixed parameters
    theta_sample <- c(x, theta[!is.na(theta)])
    
    #Simulate trajectory for selected theta
    foo <- match.fun(model)(times, inits, theta_sample)
    foo$simobs <- sapply(foo$inc, function(y) rpois(n=1, lambda=y))

    return(foo)
  })
  
  colnames(fit)[1] <- "replicate"
  
  quantiles <- plyr::ddply(.data=fit, 
                           .variables="time", 
                           function(x) quantile(x[, which(colnames(fit)=="simobs")], prob = c(0.025, 0.5, 0.975), na.rm=T)) 
  colnames(quantiles) <- c("time", "low95PPI", "median", "up95PPI")
  
  return(quantiles)
  
}
```

Check the fit. Note that I use `progress="none"` to suppress the messages in the markdown file, but you should always monitor the progress with `progress="text"`

```{r}
fit_quantiles <- sample_posterior_Pois(chain_unif2, theta, inits, times, model_SEIR, ndraw=500, nburn=nburn, progress="none")

# Plot fit
ggplot() +
  geom_point(data=data, aes(x=time, y=obs)) +
  geom_line(data=fit_quantiles, aes(x=time, y=median), color="red") +
  geom_ribbon(data=fit_quantiles, aes(x=time, ymin=low95PPI, ymax=up95PPI), alpha=0.2, fill="red")


```

### Final assessment

1. Does the trajectory fit look good?
2. Is the chain mixing healthy?
3. Have the chains converged?
4. Are the marginal posteriors are biologically meaningful, unimodal and informative (not too wide)?

In this example, we would conclude that the requirements are met and that we would keep the final model and chains. If the fit were poor, we would need to reformulate the model and/or the prior distributions, and/or add more data for fitting, or choose a different MCMC algorithm. More complex models are often difficult to fit with a Metropolis algorithm. Genetic algorithms, parallel tempering or gradient-based methods such as HMC/NUTS are more appropriate in these cases. 

If the chains do not converge, it is often a problem of a complex posterior geometry due to a complex log likelihood surface. Tightening the priors can remove some of these issues, but it is useful to first investigate the log likelihood surface of the parameters of interest to understand how the priors should be adjusted. Strictly speaking, the prior should be independent of the data (i.e. the log likelihood), but in reality you sometimes need to adjust the prior to get convergence at all for ODE models. 


## Advanced usage

### Custom priors

Let's assume we would have to modify the priors to make them more informative. In BT, if you want a custom prior, you have to write out a designated density and sampler function. We will choose a truncated normal NT(0.5, 0.1) distribution for $\beta$, and a Beta(2,5) distribution for $\sigma$. Check [here](https://homepage.divms.uiowa.edu/~mbognar/applets/normal.html) and [here](https://homepage.divms.uiowa.edu/~mbognar/applets/beta.html and) to understand what these priors look like.   

First we define the density function, which returns the joint (summed) values of the density functions of the sampled parameters:

```{r}
density <- function(par) {
  
  return(
    msm::dtnorm(par[1], # Beta 
                mean = 0.5, 
                sd = 0.1, 
                lower = lower[["beta"]], 
                upper = upper[["beta"]], 
                log = TRUE) + 
      dbeta(par[2],  # Sigma
            shape1 = 2, 
            shape2 = 5, 
            log = TRUE)
    
  )
}

```

The sampler function draws one sample from the defined prior distributions. Cave: ensure that the order of the parameters in the density function corresponds to the order in `theta[index]`:

```{r}
sampler <-  function(n=1){
  
  return(cbind(
    
    msm::rtnorm(n, # Beta 
                mean = 0.5, 
                sd = 0.1, 
                lower = lower[["beta"]], 
                upper = upper[["beta"]]),
    rbeta(n, # Sigma
          shape1 = 2, 
          shape2 = 5)
  ))
  
}

```

We then define a new prior object:

```{r}
prior_custom <- createPrior(density=density, 
                            sampler=sampler, 
                            lower=lower, 
                            upper=upper)


```

Now update the Bayesian setup and fit again:

```{r}
bayesianSetup_custom <- createBayesianSetup(prior = prior_custom,
                                     likelihood = ll_Pois_wrapper,
                                     names = names(theta[index]),
                                     parallel = FALSE)

system.time({chain_custom <- runMCMC(bayesianSetup = bayesianSetup_custom, 
                              sampler = "Metropolis", 
                              settings = mcmc_settings2)})

# Diagnostics
plot(chain_custom)
```

### Model comparison

We can calculate the Bayes Factor of two models and compare them to understand, which model is a better explanation of the data. Let's compare here the model with the uniform prior to the model with the custom prior.

The simplest approach is to compare the DIC of both models. The best model is the one which minimises the DIC (the smaller the better). Note that in real life we probably wouldn't compare two identical models which differ in their prior, but rather two different models. 

```{r}
DIC(chain_unif2, start=nburn)$DIC 
DIC(chain_custom, start=nburn)$DIC

```

We can also calculate the Bayes Factor, which is the ratio of the posteriors:
A BF of < 1 supports the alternative model M1.
A BF of > 1 supports original model M0

```{r}
M0 <- marginalLikelihood(chain_unif2, start=nburn)
M1 <- marginalLikelihood(chain_custom, start=nburn)

exp(M0$ln.ML - M1$ln.ML)

```


### Parallel sampling


BT has an option to run the chains in parallel (if multiple chains are fitted). However, it is not always much faster than the non-parallel version. This option can be setup as follows:

```{r, eval=FALSE}
# You first have to check the maximum number of cpus you can use on your computer.
library(parallel)
detectCores()

# Define how many cores you want. Use one core per chain. For DEzs, use one core per chain-bundle (=3 chains)
cpus <- 2  


# Options for parallel computing
opts <- list(packages=list("BayesianTools", "deSolve"), 
             variables=list("theta", "data", "index", "times", "model", "inits","loglik"), 
             dlls=NULL)

bayesianSetup <- createBayesianSetup(prior = prior,
                                     likelihood = ll_wrapper,
                                     names = names(theta[index]),
                                     parallel = cpus,
                                     parallelOptions = opts)

system.time({chain <- runMCMC(bayesianSetup = bayesianSetup, 
                              sampler = sampler, 
                              settings = mcmc_settings)})

stopParallel(bayesianSetup)


```


### Chain extension

You also have the option to extend a chain that has not converged yet, without starting from scratch. To do this, you can re-feed the chain object into the `runMCMC()` function. Note that this does not work if the original chain was run in parallel. 

```{r, eval=FALSE}
chain_ext <- runMCMC(bayesianSetup = chain, sampler = sampler_algo, settings = mcmc.settings)

```