JAGS_SIR2.Rmd

---
title: "A Bayesian SIR model in JAGS"
output: html_notebook
---

```{r, echo = FALSE, include= FALSE}
rm(list = ls())
set.seed(102)
```


### Introduction

In this notebook I am going to fit a Bayesian SIR model in JAGS. Some of this material is a simplification of the more advanced spatial SEIRS model found in this [thesis](https://ir.uiowa.edu/etd/1554/) by Grant Brown from 2015 which is well worth reading. Another very useful paper is [Lekone and Finkelstadt](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1541-0420.2006.00609.x?casa_token=_3CKtdvUxAgAAAAA:74e1tIqSoCFQlE6k8WmaCjc1shBMMIe435o7uF36QLkPw-kPeJqzLwC4ZB0LCIpnh8cNDLM-O2ULZqQ) which has no spatial aspect, and is the majority of what is re-created here. I have changed the notation to what I believe is a clearer version.

The main idea is to simplify the differential equations of the SIR model into difference equations and then use probability distributions and transmission matrices to quantify uncertainties between compartments. All time is treated as discrete, which is fine for current COVID-19 data which is almost all available at a daily time step. In another document I will extend this model to SEIR, and further a spatial or stratified type model. It may be possible to fit a similar model in Stan but this would require integrating out many of the discrete parameters. 

### Notation

First data:

- Let $t$ be time for $t = 1, \ldots, T$ with all time points known and equally spaced. 
- Let $N$ be the population size
- Let $N_{S \rightarrow I}(t)$ be the number of susceptible people who get infected at time $t$. This, if known, is the number of cases at time $t$.
- Let $N_{I \rightarrow R}(t)$ be the number of infected people who are removed (died or recovered) at time $t$. We have a (likely reasonable) estimate of the number of people who have died at time $t$ (or at least before time $t$) but a much poorer estimate of the number who have recovered. The latter is reported by John Hopkins but likely to be a massive under-estimate, or based on quite strict assumptions of disease length.

Now latent variables:

- Let $S(t)$ be the number of susceptible individuals at time $t$ with starting value $S(1)$
- Let $I(t)$ be the number of infected individuals at time $t$ with starting value $I(1)$
- Let $R(t)$ be the number of removed individuals at time $t$ with starting value $R(1)$

Some parameters:

- $\beta$ is the base transmission rate, i.e. the rate at which susceptible individuals become infected
- $\gamma$ is a parameter that controls the rate at which infected individuals are removed. I think the inverse of this is the mean number of days before an infected individual becomes removed

Some extra parameters are derived:

- Let $R_0$ (or $R_0(t)$) be the basic reproduction number, possibly changing over time, that determines the expected number of people who turn susceptible from an infected person
- Let $\tau^*$ be the time point when the epidemic goes extinct, i.e. when there are no more infected individuals left

### Model

The model equations are given by:
$$S(t+1) = S(t) - N_{S \rightarrow I}(t)$$
$$I(t+1) = I(t) + N_{S \rightarrow I}(t) - N_{I \rightarrow R}(t)$$
$$S(t) + I(t) + R(t) = N$$

In English these equations mean:

  - The number of susceptible individuals at time $t+1$ is equal to the previous number of susceptible individuals less the number of people who were susceptible but now infected at time $t$
  - The number of infected people at time $t+1$ is the number of infected people at time $t$ plus the number of people who move from susceptible to infected at time $t$ less the number of removed individuals who have moved from exposed to removed. 
  - The total number of people in each compartment is always equal to the population size

Each of the latent variables is given a binomial distribution

$$N_{S \rightarrow I}(t) \sim Bin(S(t), p_{S \rightarrow I}(t))$$
$$N_{I \rightarrow R}(t) \sim Bin(I(t), p_{I \rightarrow R})$$

In words these equations mean:

- The number of susceptible people who get infected at time $t$ is a binomial distribution with maximum value $S(t)$ and time-dependent probability $p_{S \rightarrow I}(t)$. This means that the number of people moving from susceptible to infected at time $t$ has to be a fraction of the total number of susceptible individuals
- The number of infected individuals who get removed at time $t$ is a binomial distribution with maximum value $I(t)$ and probability $p_{I \rightarrow R}$. This means that the number of people who get removed at time $t$ is a fraction of the number of people who are infected at time $t$

These probabilities are set at

$$p_{S \rightarrow I}(t) = 1 - \exp \left( - \beta \right)$$
This equation makes the proportion of people moving from susceptible to infected dependent on the proportion of people who are infected ($I(t)/N$), i.e. the greater the proportion of people infected, the fewer can become infected. 

$$p_{I \rightarrow R} = 1 - \exp( - \gamma)$$
where 

- $\beta$ is the transmission rate
- $1/\gamma$ is the mean infectious period

One of the key parameters to estimate is the basic reproduction number which is defined as $R_0$ or $\beta / \gamma$ in this notation. This model has the cool feature that we can estimate a time dependent reproduction number as: 

$$ R_0(t) = \frac{\beta}{\gamma}$$

Ideally, the data for this model would be $N, N_{S \rightarrow I}(t), N_{I \rightarrow R}(t)$, the latter three for all time values. In fact we often do not have access to $N_{S \rightarrow I}(t)$ (the number of susceptible individuals who become infected) so we may need to integrate out these values during the Bayesian model.

### JAGS model

Below is some code for a JAGS model:

```{r}
jags_code = '
model {
  # Likelihood
  for (t in 1:T) {
    N_I_R[t] ~ dbinom(p_I_R, I[t])
    N_S_I[t] ~ dbinom(p_S_I, S[t])
  }

  # These are the known time evolution steps:
  for(t in 2:T) {
    S[t] <- S[t-1] - N_S_I[t-1]
    I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
    R[t] <- N - S[t] - I[t]
  }

  # Need a value for S[1] and I[1]
  I[1] <- I_start # Assume number of infected people at start
  R[1] <- R_start # As above but for removed
  S[1] <- N - I[1] - R[1] # Left over

  # Probabilities and R_0  
  p_S_I <- 1 - exp( - beta )
  p_I_R <- 1 - exp( -gamma )
  R_0 <- (beta/gamma)

  # Now the prior on the hyper-parameters
  beta ~ dunif(0.3, 1) #= 0.6
  gamma <- 1/gamma_inv
  gamma_inv ~ dunif(6, 12) # 10

  # Can now forecast into the future
  for (t in (T+1):T_max) {
  
    # Transitions
    N_I_R[t] ~ dbinom(p_I_R, I[t])
    N_S_I[t] ~ dbinom(p_S_I, S[t])
    
    # Compartment values
    S[t] <- S[t-1] - N_S_I[t-1]
    I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
    R[t] = N - S[t] - I[t]
  }
}
'
```

### Fit to simulated data

I will first try simulating some data from this model and checking that JAGS actually works. 

```{r}
# Summary values
N = 1000 # Population size
T = 30 # Maximum time steps
t = 1:T

# First the hyper-parameters
gamma_inv = 10; gamma = 1/gamma_inv
beta = 0.6

# Now the probabilities
p_I_R = 1 - exp(-gamma)
p_S_I = 1 - exp(-beta)

# Give an initial values for everthing required
N_S_I = N_I_R = S = I = R = rep(NA, T)

S[1] = N - 10; I[1] = 10; R[1] = 0
N_S_I[1] = rbinom(1, S[1], p_S_I)
N_I_R[1] = rbinom(1, I[1], p_I_R)

# Now can loop through filling in the other values
for (t in 2:T) {
  S[t] <- S[t-1] - N_S_I[t-1]
  I[t] <- I[t-1] + N_S_I[t-1] - N_I_R[t-1]
  R[t] = N - S[t] - I[t] 

  # Now get the actual values
  N_I_R[t] = rbinom(1, I[t], p_I_R)
  # These are usually imputed:
  N_S_I[t] = rbinom(1, S[t], p_S_I)
}

# Create R0
R_0 = (beta/gamma)

# Create a plot of the epidemic
library(ggplot2)
library(tidyr)
tibble(t = 1:T,
       S, I, R) %>%
  pivot_longer(names_to = 'Compartment', values_to = 'People', -t) %>%
  ggplot(aes(x = t, y = People, colour = Compartment)) +
  #scale_y_log10() +
  geom_line()
```

Now we can fit this model by providing the data to JAGS

```{r, message= FALSE,  results='hide', warning=FALSE}
N_future = 50 # Forecast X days into the future
T_max = T + N_future 
jags_data = list(N = N,
                 T = T,
                 T_max = T_max, 
                 I_start = I[1],
                 R_start = R[1],
                 N_S_I = c(N_S_I, rep(NA, N_future)), # You can comment out this line if you want it to not have the number of cases. 
                 N_I_R = c(N_I_R, rep(NA, N_future))) 

library(R2jags)
jags_run = jags(data = jags_data,
                parameters.to.save = c("gamma_inv", "beta",
                                       "S","I", "R", "R_0"),
                model.file = textConnection(jags_code))
plot(jags_run)
```

Now plot the number of infected over time:

```{r}
S_post = jags_run$BUGSoutput$median$S
I_post = jags_run$BUGSoutput$median$I
R_post = jags_run$BUGSoutput$median$R

tibble(t = 1:T_max,
  S_post, I_post, R_post) %>%
  pivot_longer(names_to = 'Compartment', values_to = 'People', -t) %>%
  ggplot(aes(x = t, y = People, colour = Compartment)) +
  geom_line()

```

Now plot the parameter estimates against their true values

```{r}
post = jags_run$BUGSoutput$sims.list
tibble(iter = 1:length(post$beta),
       beta = post$beta,
       gamma_inv = post$gamma_inv,
       R0 = post$R_0) %>%
  pivot_longer(names_to = 'Type', values_to = 'Sample',-iter) %>%
  ggplot(aes(x = Sample, fill = Type)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ Type, scales = 'free') +
  geom_vline(data = tibble(iter = rep(1, 3),
                           Sample = c(beta, gamma_inv, R_0),
                           Type = c('beta', 'gamma_inv', 'R0')),
             aes(xintercept = Sample)) +
  theme(legend.position = 'None')
```
All values seem to match, depending on whether you give it the number of cases ($N_{S \rightarrow I}$) or not.