6 Bayesian Updating & Spatial Temporal Modelling

6.1 Introduction

This week’s practical will show you how to perform Bayesian Updating. This approach is excellent for analysing new information that is received in separate streams either on a ‘regular’ or ‘irregular’ basis. It approach is highly suitable within a repeated cross-sectional analytical framework. The definition of Bayesian Updating is a statistical method for revising the estimated probabilities of events you have observed previously, and this revision is based on new evidence (i.e., new data).

The learning outcomes for this section, we will see how to apply this technique to mosquito surveillance data to provide the following:

  • An update on the risk profile of neighbourhoods;
  • See how the risk parameters evolve in the face of new data;
  • The main goal is show you the process of updating and data managing aspects;

6.1.1 Loading and installing packages

Now, lets load all packages needed specifically for this computer practical:

# Load the packages with library()
library("sf")
library("tmap")
library("spdep")
library("rstan")
library("geostan")
library("SpatialEpi")
library("tidybayes")
library("tidyverse")

As usual, upon loading the rstan package, it is highly recommended to run this code to configure the settings in RStudio:

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

6.1.2 Datasets & setting up the work directory

Set your work directory to Week 5’s folder. For example, for Windows users, the code for setting the work directory will be:

setwd("C:/Users/AccountName/Desktop/GEOG0125/Week 5")

For MAC, the code for setting the work directory will be:

setwd("/Users/AccountName/Desktop/GEOG0125/Week 5")

The dataset for this section of the practical are:

  • Lira_January_2017.csv
  • Lira_April_2017.csv
  • Lira_July_2017.csv
  • Campina_Grande_Lira2017.shp

The surveillance and environmental information are contained in monthly LIRA files. The spatial configuration is the shape file for Campina Grande (Brazil).

  • The Campina Grande shape file contains 47 neighbourhoods. The unique identifiers, neighbourhood number and names are defined under the columns osmID, NeighbNum and NeighbNam respectively;
  • Across the January, April and July dataset for LIRA 2017. It contains the following variables: osmID, NeighbNum and NeighbNam to link with the shape file. The InfestedNum is the dependent variable that represents the total number of households identified as infested with mosquito breeding habitats in a neighbourhood. The column Total is the total number of households in a neighbourhood. This denominators serves as the reference population. Hence, it is used to compute the expected number of infested households in neighbourhoods. This will be used as an offset in the Bayesian model.
  • Lastly, the following Temperature, Precipitation, NDVI and Urbanisation are the independent variables that will be modelling continuously against the incident number of infested households.

Let us load these dataset to memory:

# load the data sets
Lira2017_1 <- read.csv("Lira_January_2017.csv")
Lira2017_2 <- read.csv("Lira_April_2017.csv")
Lira2017_3 <- read.csv("Lira_July_2017.csv")

# load the shapefile
campina_grande_neighbourhoods <- read_sf("Campina_Grande_Lira2017.shp")

6.1.3 Data preparation in RStudio

6.1.3.1 Converting the spatial adjacency matrix to nodes & edges

We will need to derive a set list of nodes and edges accordingly as Stan can only identify a spatial adjacency with a set of paired nodes with edges that connect them all together. For instance, node1 represents the index area and node2 is a list of neighbouring areas connected to the index region defined in node1 (see section 5.3.2, Week 4).

We can perform this spatial operation by first coercing the sf simple features object (i.e., campina_grande_neighbourhoods) to be to the spatial object sp.

Here is the code:

# coerced to a "sp" object
campina_grande_neighb_spobject <- as(campina_grande_neighbourhoods, "Spatial")

Now, we are going to derive the nodes and edges from the campina_grande_neighb_spobject using the shape2mat() function - this changes it into a matrix object. From the matrix object, we will be able to prepare the data from spatial ICAR model using the prep_icar_data() function. Here, is the code:

# coerced into a "matrix" object
adjacencyMatrix <- shape2mat(campina_grande_neighb_spobject)
# we will need to extract the components for the ICAR model when setting up the data set for Stan
extractComponents <- prep_icar_data(adjacencyMatrix)

From the extractComponents object, we will need to extract the following contents:

  • $group_size this is the number of areal units under observation listed in the shapefile (should be the same in the LIRA dataset)
  • $node1 are index neighbourhood of interest
  • $node2 are the other neighbouring neighbourhoods that are connected to the index neighbourhood of interest listed in node1
  • $n_edges creates the network as show area is connected to what neighbourhood. It is still an adjacency matrix using the queen contiguity matrix but as a network since Stan can only use this structure.

Here is the code for performing the extraction:

n <- as.numeric(extractComponents$group_size)
nod1 <- extractComponents$node1
nod2 <- extractComponents$node2
n_edges <- as.numeric(extractComponents$n_edges)

Note that the information needed to be compiled in Stan are stored in n, nod1, nod2 and n_edges.

6.1.3.2 Creating the baseline set of data for Stan

Next the step, we will need to prepare the data as a list object and define the dependent and independent variables needed to be compiled in Stan. The outcome InfestedNum is going to be made passed as y. The set of independent variables i.e., Temperature, Precipitation, NDVI and Urbanisation is going to be made as a matrix object called x, K is going to contain the number of independent variables i.e., 4, and the offset variable from the Expected will be extracted as e.

Here is the code:

y=Lira2017_1$InfestedNum
x=Lira2017_1[,c(9,10,11,12)]
e=Lira2017_1$Expected

Now, we create our dataset for Stan:

# put all components into a list object
stan.spatial.baseline <- list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), off_set=e)

The above information is going to be passed to data block in Stan for compilation.

6.1.4 Full Stan script for the Spatial ICAR model

We will use full script for executing the spatial intrinsic conditional auto-regressive (ICAR) model. This will be performed on the baseline dataset stan.spatial.baseline.

Please open a new Stan file and use the following codes:

// Stan script: Spatial ICAR Poisson model

data {
  int<lower=0> N;                                                        // number of spatial units or neighbourhoods in Campina Grande
  int<lower=0> N_edges;                                                  // number of edges connecting adjacent areas using Queens contiguity
  array[N_edges] int<lower=1, upper=N> node1;                            // list of index areas showing which spatial units are neighbours
  array[N_edges] int<lower=1, upper=N> node2;                            // list of neighbouring areas showing the connection to index spatial unit
  array[N] int<lower=0> Y;                                               // dependent variable i.e., number of households infested
  vector<lower=1>[N] offset;                                             // total number of houses in neoighbourhoods are used as an offset variable
  int<lower=1> K;                                                        // number of independent variables i.e., K = 4
  matrix[N, K] X;                                                        // independent variables in matrix form i.e., temp, prec, ndvi, and urban
}

transformed data {
    vector[N] log_offset = log(offset);
}

parameters {
  real alpha;                                                            // intercept
  vector[K] beta;                                                        // covariates
  real<lower=0> sigma;                                                   // overall standard deviation
  real<lower=0, upper=1> rho;                                            // proportion unstructured vs. spatially structured variance
  vector[N] theta;                                                       // unstructure random effects (heterogeneous)
  vector[N] phi;                                                         // spatial random effects
}

transformed parameters {
  vector[N] combined;                                                    // combined random effect i.e., unstructure and structured

  combined = sqrt(1 - rho) * theta + sqrt(rho) * phi;                    // formulation for the combined random effect i.e., unstructure and structured
}

model {
  Y ~ poisson_log(log_offset + alpha + X * beta + combined * sigma);     // likelihood function: multivariable Poisson ICAR regression model
  alpha ~ normal(0.0, 1.0);                                              // prior for alpha: weakly informative
  beta ~ normal(0.0, 1.0);                                               // prior for betas: weakly informative
  theta ~ normal(0.0, 1.0);                                              // prior for theta: weakly informative
  sigma ~ normal(0.0, 1.0);                                              // prior for sigma: weakly informative
  rho ~ beta(0.5, 0.5);                                                  // prior for rho: pulled for literature

  target += -0.5 * dot_self(phi[node1] - phi[node2]);                    // prior for phi: using a conditional autoregressive distribution
  sum(phi) ~ normal(0, 0.001 * N);                                       // soft sum-to-zero constraint on phi, and its the equivalent to mean(phi) ~ normal(0,0.001)
}

generated quantities {
  vector[N] eta = alpha + X * beta + combined * sigma;                  // compute eta and exponentiate into mu                   
  vector[N] rr_mu = exp(eta);                                           // output the neighbourhood-specific relative risks in mu
  vector[K] rr_beta = exp(beta);                                        // output the risk ratios for each coefficient for the independent variables
  real rr_alpha = exp(alpha);                                           // output the risk ratios for the intercept
}

Make sure to save the Stan script as ICAR_model_baseline.stan the working directory.

6.1.5 Compiling Stan code for the Baseline ICAR model

Now, let us turn our attention back to RStudio. Use the stan() function to compile the saved script to obtain the posterior estimation of the parameters from our baseline model:

# starts timer
ptm <- proc.time()

# Bayesian model
Bayesian.model.baseline = stan("ICAR_model_baseline.stan", 
    data=stan.spatial.baseline, 
    iter=10000, chains=6, verbose = FALSE, 
    control = list(max_treedepth = 12))

# stops and calculates how long the model took to process with 10000 iterations on 6 cores
proc.time() - ptm

6.1.5.1 Rapid diagnostics

6.1.5.1.1 First check: rHATs are less than 1.05

Before interpreting the overall risks, as well as mapping them, we must check if there are any parameter estimatess i.e., alpha, beta, rr_alpha, rr_beta, sigma, all phi and lp__ whose rHAT exceed the threshold value of 1.05. This is an indication that the iterations did not perform well if an rHAT for a parameter is above 1.05 meaning that samples did not converge to a (plausible) value. We can do a rapid checks to see which parameters are valid, or not valid, by creating a binary variable of 1’s (i.e., Valid if rHAT < 1.05), otherwise 0’s (Not valid).

For instance, running this code would should the summary table to its fullest:

# print full table to avoid some rows from being omitted.
options(max.print = 100000)

# print the all results for quick diagnostics
print(Bayesian.model.baseline, 
    pars=c("alpha", "beta", "rr_alpha", "rr_beta", "rr_mu", "sigma", "phi", "lp__"), 
    probs=c(0.025, 0.975)
    )

You can eyeball the column with the rHAT values in the printed output. Or, you can use the code to tabulate it to see the numbers - if its only a frequency for the value of 1, then you are fine:

# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(
    summary(Bayesian.model.baseline, pars=c("alpha", "beta", "rr_alpha", "rr_beta", "rr_mu", "sigma", "phi", "lp__"), 
        probs=c(0.025, 0.5, 0.975))$summary
    )

# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)

# tabulate and check the rHATs is below 1.05 for all parameter estimates
table(diagnostic.checks$valid)
# if there is a frequency for `0` than the test has failed
# you must perform the MCMC sampling again with a bigger number of iterations

Everything is okay - all outputted parameters have an rHAT < 1.05.

6.1.5.1.2 Second check: Make sure that all/most samples have converged

In my computer, we computed a total of 60,000 iterations across the 6 chains (i.e., 10,000 iterations), of which half the number is discarded. Hence, we have a sample of 30,000 samples generated for each parameter. Now, we need to check for samples that did not converge and ensure that does not exceed 10%.

# check across each chains for samples that have diverged (should not exceed 10% of the total 30000)
d1 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6

# tells you how many samples per chain did not converged (i.e., diverged)
d1
d2
d3
d4
d5
d6

# Check the proportion of diverged samples is greater than 10
(d1+d2+d3+d4+d5+d6)/30000 * 100 > 10
# if `TRUE` than the test has failed
# you must perform the MCMC sampling again with a bigger number of iterations

Everything is okay - it shows FALSE so most samples have converged. We can now proceed to the extraction of our results!

NOTES: To avoid such complications, it is always to best to run about 10000, 15000 or more iterations with the argument control = list(max_treedepth = 12) included in the stan() function. Usually, shorter iterations yield low effective sample sizes after warm-up samples are discarded, which in turn, may lead to complications that may cause the rHAT to be above 1.05, or samples that won’t converge. Notice, how today we have used up to 10000 iterations!

6.1.5.2 Extraction of results to user-friendly format

We are going to extract the results from the baseline model object. Pay close attention to this section - it shows how to extract all the necessary information and building the results into a table to be inserted into a document.

6.1.5.2.1 Extract the global results for relative risks and exceedence probabilities

This section computes and extracts the exceedance probabilities for the baseline relative risk for the intercept, and for each independent variable in the order of Temperature, Precipitation, NDVI and Urbanisation:

# print summary table
print(Bayesian.model.baseline, 
    pars=c("rr_alpha", "rr_beta", "sigma"), 
    probs=c(0.025, 0.975)
)

# compute exceedance probability for the intercept/baseline risk is bigger null value 1
threshold.intercept <- function(x){mean(x > 1.00)}
rr_alpha.exc.probs <- Bayesian.model.baseline %>% spread_draws(rr_alpha) %>% 
    summarise(rr_alpha=threshold.intercept(rr_alpha)) %>%
    pull(rr_alpha)

# compute exceedance probability for each of the independent variables/risk are bigger than 1
threshold.coefficients <- function(x){mean(x > 1.00)}
rr_beta.exc.probs <- Bayesian.model.baseline %>% spread_draws(rr_beta[i]) %>% 
    group_by(i) %>% summarise(rr_beta=threshold.coefficients(rr_beta)) %>%
    pull(rr_beta)

# reports exceedance probabilities
rr_alpha.exc.probs
rr_beta.exc.probs
6.1.5.2.2 Building a table of results for the baseline regression model

This is good’ol data management code to prepare the results into something that’s user-friendly:

# creating a table
# step 1: create the names in the appropriate order as there were modeled in Stan
names <- c("Baseline Risk", "Temperature", "Precipitation", "NDVI", "Urbanisation")

# step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.baseline, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)

# step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)

# step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX) using the paste() function
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")

# step 5: pull the exceedance probabilities into the result table
a <- c(rr_alpha.exc.probs, rr_beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")

# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]

# view final output and... presto! not too shabby!
final.output

The table with global results should look something like:

      variables                         beta_95CrI   Uncertainty   ess   rhat
1 Baseline Risk  1.467 (95% CrI: 0.1215 to 6.3608) Prob = 0.4597 29443 1.0001
2   Temperature 0.9307 (95% CrI: 0.8295 to 1.0414) Prob = 0.1030 10581 1.0004
3 Precipitation 1.1547 (95% CrI: 1.0353 to 1.2855) Prob = 0.9944  8654 1.0006
4          NDVI  0.929 (95% CrI: 0.6228 to 1.3344) Prob = 0.3102 14313 1.0001
5  Urbanisation 1.1793 (95% CrI: 0.5638 to 2.1582) Prob = 0.6279 13990 0.9999

Now, preserve this Global regression result in a spreadsheet:

# Now export Global table results into a .csv file
write.csv(final.output, file = "Global_Result_January_2017.csv", row.names = FALSE)
6.1.5.2.3 Mapping the relative risk and significance for the baseline result

Same approach covered in Week 4’s computer practical (sections 5.5.3, 5.5.4 and 5.5.5).

Extract the area-specific relative risks for the neighbourhoods:

# extraction key posterior results for the generated quantities 
relativeRisk.results <- as.data.frame(summary(Bayesian.model.baseline, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"

Merge the results to the shapefile:

# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]

Create the significance categories:

# create categories to define if an area has significant increase or decrease in risk, or nothing all 
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0    # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0  # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1    # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1   # SIGNIFICANT DECREASE

# use these functions to understand the distribution of these risk estimates
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)

Compute the exceedance probabilities for the areas and categorise them in bands of 10s with their map labels:

# compute the exceedance probabilities for the areas
threshold <- function(x){mean(x > 1.00)}
excProbrr <- Bayesian.model.baseline %>% spread_draws(rr_mu[i]) %>% 
    group_by(i) %>% summarise(rr_mu=threshold(rr_mu)) %>%
    pull(rr_mu)

# insert the exceedance values into the spatial data frame
spatial.data$excProb <- excProbrr

# categorising the probabilities in bands of 10s
spatial.data$ProbCat <- NA
spatial.data$ProbCat[spatial.data$excProb>=0 & spatial.data$excProb< 0.01] <- 1
spatial.data$ProbCat[spatial.data$excProb>=0.01 & spatial.data$excProb< 0.10] <- 2
spatial.data$ProbCat[spatial.data$excProb>=0.10 & spatial.data$excProb< 0.20] <- 3
spatial.data$ProbCat[spatial.data$excProb>=0.20 & spatial.data$excProb< 0.30] <- 4
spatial.data$ProbCat[spatial.data$excProb>=0.30 & spatial.data$excProb< 0.40] <- 5
spatial.data$ProbCat[spatial.data$excProb>=0.40 & spatial.data$excProb< 0.50] <- 6
spatial.data$ProbCat[spatial.data$excProb>=0.50 & spatial.data$excProb< 0.60] <- 7
spatial.data$ProbCat[spatial.data$excProb>=0.60 & spatial.data$excProb< 0.70] <- 8
spatial.data$ProbCat[spatial.data$excProb>=0.70 & spatial.data$excProb< 0.80] <- 9
spatial.data$ProbCat[spatial.data$excProb>=0.80 & spatial.data$excProb< 0.90] <- 10
spatial.data$ProbCat[spatial.data$excProb>=0.90 & spatial.data$excProb< 1.00] <- 11
spatial.data$ProbCat[spatial.data$excProb == 1.00] <- 12

# create the map labels for the probabilities
ProbCategorylist <- c("<0.01", "0.01-0.09", "0.10-0.19", "0.20-0.29", "0.30-0.39", "0.40-0.49","0.50-0.59", "0.60-0.69", "0.70-0.79", "0.80-0.89", "0.90-0.99", "1.00")

Checking the labels and getting a feel of its distribution - to inform how I will be build my maps and whether there is balance in the labels etc:

# use these functions to understand the distribution of these risk estimates
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)

# check to see if legend scheme is balanced
table(spatial.data$ProbCat)
> summary(spatial.data$rr)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.7028  1.2050  1.8056  1.7339  2.3294  2.9316 
> hist(spatial.data$rr)
> table(spatial.data$Significance)

-1  0  1 
 4 10 33 
> 
> # check to see if legend scheme is balanced
> table(spatial.data$ProbCat)

 1  2  3 10 11 12 
 3  5  1  4  8 26

NOTES: The relative estimates basically ranges from 0.70 to 2.93. There are 4 neighbourhoods with a reduced risk that’s significant. About 10 neighbourhoods that is not significant and 33 neighbourhoods with a significant increase risk. In terms of exceedance probabilities, there are 26 neighbourhoods predicted to have 100% chance of having increased risk of infestation. 8 neighbourhoods predicted to have 90-99% chance of having an increased risk of infestation, 4 with 80-89% chance, 1 neighbourhood with 10-19% chance, 8 neighbourhoods in total with <1-9% chances of infestation.

Create the maps and save as .png format:

# map of relative risk
rr_map <- tm_shape(spatial.data) + 
    tm_fill("Risk_LiraA_2017_1", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))

# map of significance regions
sg_map <- tm_shape(spatial.data) + 
    tm_fill("Sign_LiraA_2017_1", style = "cat", title = "Significance Categories", 
        palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))

tmap_save(rr_map, "rr_2017-1.png", height = 12)
tmap_save(sg_map, "sign_2017-1.png", height = 12)
tmap_save(ep_map, "ep_2017-1.png", height = 12)

We have completed the first part of the analysis with the baseline model. We are now going to implement the Bayesian updating with the April and July data set. Before we do any thing - make sure to save these results to avoid losing this work. We will the data set and use them later to chart the trajectory of risks to see if certain areas have sustained risk across all surveys periods.

# retain dataset for future
baseline_result <- st_drop_geometry(spatial.data)
# save initial results
write.csv(baseline_result, file = "Baseline 2017-1.csv", row.names = FALSE)

6.1.6 Apply the Bayesian updating on subsequent data sets

We are going to implement the Bayesian updating on the following data frames: 1.) Lira_2017_2 and Lira_2017_3. Therefore, we will need to create the subsequent data sets for the updating in the same manner as shown in section 7.2.3.2. Let’s start with Lira_2017_2 and create a new list object called stan.dataset.2017_2.

y=Lira2017_2$InfestedNum
x=Lira2017_2[,c(9,10,11,12)]
e=Lira2017_2$Expected

stan.dataset.2017_2=list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), offset=e)

NOTES: The y, x and e will be overwritten.

Now, in the stan() function, instead of calling the Stan script for compilation - we will use the previous model object Bayesian.model.baseline in the fit= argument. The new dataset will be passed on to the data= argument. The updated posterior estimates for the relative risk will be passed into a new object called Bayesian.model.updated1.2017_2.

# Start the clock and time duration
ptm <- proc.time()
# Updated Bayesian model
Bayesian.model.updated1.2017_2 = stan(fit=Bayesian.model.baseline, data=stan.dataset.2017_2, iter=60000, chains=6, verbose = FALSE, control = list(max_treedepth = 12))
# Stop the clock and report duration
proc.time() - ptm

Repeated the updating process again using the data set Lira_2017_3 and previous model Bayesian.model.updated1.2017_2. The updated posterior estimates for the relative risk will be passed into a new object called Bayesian.model.updated2.2017_3.

y=Lira2017_3$InfestedNum
x=Lira2017_3[,c(9,10,11,12)]
e=Lira2017_3$Expected

stan.dataset.2017_3=list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), offset=e)

ptm <- proc.time()
Bayesian.model.updated2.2017_3 = stan(fit=Bayesian.model.updated1.2017_2, data=stan.dataset.2017_3, iter=60000, chains=6, verbose = FALSE, control = list(max_treedepth = 12))
proc.time() - ptm

6.1.6.1 Extraction of updated results to a user-friendly format

The code for extraction is essentially the same as what’s shown in section 7.2.5. Here is the code for extracting the updated results from object Bayesian.model.updated1.2017_2:

[Click here to see code]:
# repeat the data management - Bayesian.model.updated1.2017_2
# replace all key objects with Bayesian.model.updated1.2017_2

# diagnostics
# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars=c("alpha", "beta", "sigma", "phi"), probs=c(0.025, 0.5, 0.975))$summary)
# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)
# tabulate it
table(diagnostic.checks$valid)

# check rHAT < 1.05
# check chains for samples that diverged (should not exceed 10% of the total 180000)
d1 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6
(d1+d2+d3+d4+d5+d6)/180000 * 100

# extract the exceedance probabilities for the intercept & each beta coefficient
# note: the previous alpha.exc.probs and beta.exc.probs will be overwritten!
threshold.intercept <- function(x){mean(x > 0.00)}
alpha.exc.probs <- Bayesian.model.updated1.2017_2 %>% spread_draws(alpha) %>% 
    summarise(alpha=threshold.intercept(alpha)) %>%
    pull(alpha)

threshold.coefficients <- function(x){mean(x > 0.00)}
beta.exc.probs <- Bayesian.model.updated1.2017_2 %>% spread_draws(beta[i]) %>% 
    group_by(i) %>% summarise(beta=threshold.coefficients(beta)) %>%
    pull(beta)

# reports updated exceedance probabilities from Bayesian.model.updated1.2017_2
alpha.exc.probs
beta.exc.probs

# Step 1: create the names in the appropriate order as there were modelled in Stan
names <- c("Intercept", "Temperature", "Precipitation", "NDVI", "Urbanisation")

# Step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)

# Step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)

# Step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX)
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")

# Step 5: pull the exceedance probabilities into the result table
a <- c(alpha.exc.probs, beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")

# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]

# view final output and... presto! not too shabby!
final.output

# Now export regression results into a .csv file
# NOTE: Update this line of code and save the file with a different name
write.csv(final.output, file = "Updated_Result_April_2017.csv", row.names = FALSE)

# spatial maps of relative risks

# extraction of key updated posterior results for the generated quantities 
relativeRisk.results <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"

# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]

# create categories to define if an area has significant increase or decrease in risk, or nothing all 
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0    # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0  # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1    # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1   # SIGNIFICANT DECREASE

summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)

# we create these column to preserve this result for the updates from April
# NOTE: This code needs to change '_2' to show its for April
colnames(spatial.data)[5] <- "Risk_LiraA_2017_2"
colnames(spatial.data)[8] <- "Sign_LiraA_2017_2"
spatial.data$LiraA_2017_2 <- paste(round(spatial.data$Risk_LiraA_2017_2,2), " (95% CrI:", round(spatial.data$rrlower,2), ",", round(spatial.data$rrupper,2), ")", sep = "")

campina_grande_neighbourhoods$NeighNum <- as.numeric(campina_grande_neighbourhoods$NeighNum)
# map of relative risk
# NOTE: This code needs to change '_2' to show its for April
rr_map <- tm_shape(spatial.data) + 
    tm_fill("Risk_LiraA_2017_2", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))

# map of significance regions
# NOTE: This code needs to change '_2' to show its for April
sg_map <- tm_shape(spatial.data) + 
    tm_fill("Sign_LiraA_2017_2", style = "cat", title = "Significance Categories", 
        palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# NOTE: This code needs to change '-2' to show its for April
tmap_save(rr_map, "rr_2017-2.png", height = 12)
tmap_save(sg_map, "sign_2017-2.png", height = 12)

# chart the trajectory of risks to see if the remain sustained across all LIRAa surveys
# create a sink for data posterior estimates and build
# NOTE: This code needs to change '_2' to show its for April
risk_data_2 <- st_drop_geometry(spatial.data[,c("osmID", "NeighNum", "Risk_LiraA_2017_2", "Sign_LiraA_2017_2", "LiraA_2017_2")])
# all results will be sinked into this subsequent file
# NOTE: This code needs to change '-2' to show its for April
write.csv(risk_data_2, file = "Updated 2017-2.csv", row.names = FALSE)

Here is the code for extracting the updated results from object Bayesian.model.updated2.2017_3:

[Click here to see code]:
# again, repeat the data management for Bayesian.model.updated2.2017_3
# replace all key objects with Bayesian.model.updated2.2017_3

# diagnostics
# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars=c("alpha", "beta", "sigma", "phi"), probs=c(0.025, 0.5, 0.975))$summary)
# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)
# tabulate it
table(diagnostic.checks$valid)

# check rHAT < 1.05
# check chains for samples that diverged (should not exceed 10% of the total 180000)
d1 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6
(d1+d2+d3+d4+d5+d6)/180000 * 100

# extract the exceedance probabilities for the intercept & each beta coefficient
# note: the previous alpha.exc.probs and beta.exc.probs will be overwritten!
threshold.intercept <- function(x){mean(x > 0.00)}
alpha.exc.probs <- Bayesian.model.updated2.2017_3 %>% spread_draws(alpha) %>% 
    summarise(alpha=threshold.intercept(alpha)) %>%
    pull(alpha)

threshold.coefficients <- function(x){mean(x > 0.00)}
beta.exc.probs <- Bayesian.model.updated2.2017_3 %>% spread_draws(beta[i]) %>% 
    group_by(i) %>% summarise(beta=threshold.coefficients(beta)) %>%
    pull(beta)

# reports updated exceedance probabilities from Bayesian.model.updated2.2017_3
alpha.exc.probs
beta.exc.probs

# Step 1: create the names in the appropriate order as there were modelled in Stan
names <- c("Intercept", "Temperature", "Precipitation", "NDVI", "Urbanisation")

# Step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)

# Step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)

# Step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX)
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")

# Step 5: pull the exceedance probabilities into the result table
a <- c(alpha.exc.probs, beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")

# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]

# view final output and... presto! not too shabby!
final.output

# Now export regression results into a .csv file
# NOTE: Update this line of code and save the file with a different name
write.csv(final.output, file = "Updated_Result_July_2017.csv", row.names = FALSE)

# spatial maps of relative risks

# extraction of key updated posterior results for the generated quantities 
relativeRisk.results <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"

# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]

# create categories to define if an area has significant increase or decrease in risk, or nothing all 
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0    # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0  # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1    # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1   # SIGNIFICANT DECREASE

summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)

# we create these column to preserve this result for the updates from April
# NOTE: This code needs to change '_3' to show its for April
colnames(spatial.data)[5] <- "Risk_LiraA_2017_3"
colnames(spatial.data)[8] <- "Sign_LiraA_2017_3"
spatial.data$LiraA_2017_3 <- paste(round(spatial.data$Risk_LiraA_2017_3,2), " (95% CrI:", round(spatial.data$rrlower,2), ",", round(spatial.data$rrupper,2), ")", sep = "")

campina_grande_neighbourhoods$NeighNum <- as.numeric(campina_grande_neighbourhoods$NeighNum)
# map of relative risk
# NOTE: This code needs to change '_3' to show its for April
rr_map <- tm_shape(spatial.data) + 
    tm_fill("Risk_LiraA_2017_3", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))

# map of significance regions
# NOTE: This code needs to change '_3' to show its for April
sg_map <- tm_shape(spatial.data) + 
    tm_fill("Sign_LiraA_2017_3", style = "cat", title = "Significance Categories", 
        palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
    tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
    tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
    tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# NOTE: This code needs to change '-3' to show its for April
tmap_save(rr_map, "rr_2017-3.png", height = 12)
tmap_save(sg_map, "sign_2017-3.png", height = 12)

# chart the trajectory of risks to see if the remain sustained across all LIRAa surveys
# create a sink for data posterior estimates and build
# NOTE: This code needs to change '_3' to show its for July
risk_data_3 <- st_drop_geometry(spatial.data[,c("osmID", "NeighNum", "Risk_LiraA_2017_3", "Sign_LiraA_2017_3", "LiraA_2017_3")])
# all results will be sinked into this subsequent file
# NOTE: This code needs to change '-3' to show its for April
write.csv(risk_data_3, file = "Updated 2017-3.csv", row.names = FALSE)

6.1.7 List of key outputs

  1. The regression results with baseline and updated relative risks stored in 3 separate tables in following files:
  • January (Baseline): Baseline_Result_January_2017.csv
  • April (Updated): Updated_Result_April_2017.csv
  • July (Updated): Updated_Result_July_2017.csv
  1. 3 maps that show relative risks and 3 maps that show significant neighbourhoods. See the following files:
  • January (Baseline): rr_2017-1.png and sign_2017-1.png
  • April (Updated): rr_2017-2.png and sign_2017-2.png
  • July (Updated): rr_2017-3.png and sign_2017-3.png
  1. The relative risk and significance categories from baseline and update models stored in 3 separate tables. These are for mapping and charting those risk trajectories. See the following files:
  • January (Baseline): Baseline 2017-1.csv
  • April (Updated): Updated 2017-2.csv
  • July (Updated): Updated 2017-3.csv

IMPORTANT: Note that these tables are used to create to risk trajectories shown in Week 9’s lecture on slide 39. The columns NeighNum, Sign_LiraA_2017_1, Sign_LiraA_2017_2 and Sign_LiraA_2017_3 are arranged in a separate excel spreadsheet and coloured manually in Blue for -1 (significantly lower risks), Grey for 0 (Not significant) and Red for 1 (significantly higher risks).

6.2 Finished!

And that’s all to it. That’s how we conduct Bayesian inference - I hope you’ve enjoyed learning this subject. Well done for getting through these practicals! We know, they were long and a grueling process - but you have done well to learn this branch of statistics in the Advanced Topics in Social & Geographic Data Sciences. As an appreciation, we end with compliments by expression through this iconic scene from one of our all time great anime, Neon Genesis Evangelion.