6 Bayesian Updating & Spatial Temporal Modelling
6.1 Introduction
This week’s practical will show you how to perform Bayesian Updating. This approach is excellent for analysing new information that is received in separate streams either on a ‘regular’ or ‘irregular’ basis. It approach is highly suitable within a repeated cross-sectional analytical framework. The definition of Bayesian Updating is a statistical method for revising the estimated probabilities of events you have observed previously, and this revision is based on new evidence (i.e., new data).
The learning outcomes for this section, we will see how to apply this technique to mosquito surveillance data to provide the following:
- An update on the risk profile of neighbourhoods;
- See how the risk parameters evolve in the face of new data;
- The main goal is show you the process of updating and data managing aspects;
6.1.1 Loading and installing packages
Now, lets load all packages needed specifically for this computer practical:
# Load the packages with library()
library("sf")
library("tmap")
library("spdep")
library("rstan")
library("geostan")
library("SpatialEpi")
library("tidybayes")
library("tidyverse")
As usual, upon loading the rstan
package, it is highly recommended to run this code to configure the settings in RStudio:
6.1.2 Datasets & setting up the work directory
Set your work directory to Week 5’s folder. For example, for Windows users, the code for setting the work directory will be:
For MAC, the code for setting the work directory will be:
The dataset for this section of the practical are:
Lira_January_2017.csv
Lira_April_2017.csv
Lira_July_2017.csv
Campina_Grande_Lira2017.shp
The surveillance and environmental information are contained in monthly LIRA files. The spatial configuration is the shape file for Campina Grande (Brazil).
- The Campina Grande shape file contains 47 neighbourhoods. The unique identifiers, neighbourhood number and names are defined under the columns
osmID
,NeighbNum
andNeighbNam
respectively; - Across the January, April and July dataset for LIRA 2017. It contains the following variables:
osmID
,NeighbNum
andNeighbNam
to link with the shape file. TheInfestedNum
is the dependent variable that represents the total number of households identified as infested with mosquito breeding habitats in a neighbourhood. The columnTotal
is the total number of households in a neighbourhood. This denominators serves as the reference population. Hence, it is used to compute the expected number of infested households in neighbourhoods. This will be used as an offset in the Bayesian model. - Lastly, the following
Temperature
,Precipitation
,NDVI
andUrbanisation
are the independent variables that will be modelling continuously against the incident number of infested households.
Let us load these dataset to memory:
6.1.3 Data preparation in RStudio
6.1.3.1 Converting the spatial adjacency matrix to nodes & edges
We will need to derive a set list of nodes and edges accordingly as Stan can only identify a spatial adjacency with a set of paired nodes with edges that connect them all together. For instance, node1
represents the index area and node2
is a list of neighbouring areas connected to the index region defined in node1
(see section 5.3.2, Week 4).
We can perform this spatial operation by first coercing the sf
simple features object (i.e., campina_grande_neighbourhoods
) to be to the spatial object sp
.
Here is the code:
# coerced to a "sp" object
campina_grande_neighb_spobject <- as(campina_grande_neighbourhoods, "Spatial")
Now, we are going to derive the nodes and edges from the campina_grande_neighb_spobject
using the shape2mat()
function - this changes it into a matrix object. From the matrix object, we will be able to prepare the data from spatial ICAR model using the prep_icar_data()
function. Here, is the code:
# coerced into a "matrix" object
adjacencyMatrix <- shape2mat(campina_grande_neighb_spobject)
# we will need to extract the components for the ICAR model when setting up the data set for Stan
extractComponents <- prep_icar_data(adjacencyMatrix)
From the extractComponents
object, we will need to extract the following contents:
$group_size
this is the number of areal units under observation listed in the shapefile (should be the same in the LIRA dataset)$node1
are index neighbourhood of interest$node2
are the other neighbouring neighbourhoods that are connected to the index neighbourhood of interest listed innode1
$n_edges
creates the network as show area is connected to what neighbourhood. It is still an adjacency matrix using the queen contiguity matrix but as a network since Stan can only use this structure.
Here is the code for performing the extraction:
n <- as.numeric(extractComponents$group_size)
nod1 <- extractComponents$node1
nod2 <- extractComponents$node2
n_edges <- as.numeric(extractComponents$n_edges)
Note that the information needed to be compiled in Stan are stored in n
, nod1
, nod2
and n_edges
.
6.1.3.2 Creating the baseline set of data for Stan
Next the step, we will need to prepare the data as a list object and define the dependent and independent variables needed to be compiled in Stan. The outcome InfestedNum
is going to be made passed as y
. The set of independent variables i.e., Temperature
, Precipitation
, NDVI
and Urbanisation
is going to be made as a matrix object called x
, K
is going to contain the number of independent variables i.e., 4, and the offset variable from the Expected
will be extracted as e
.
Here is the code:
Now, we create our dataset for Stan:
# put all components into a list object
stan.spatial.baseline <- list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), off_set=e)
The above information is going to be passed to data block
in Stan for compilation.
6.1.4 Full Stan script for the Spatial ICAR model
We will use full script for executing the spatial intrinsic conditional auto-regressive (ICAR) model. This will be performed on the baseline dataset stan.spatial.baseline
.
Please open a new Stan file and use the following codes:
// Stan script: Spatial ICAR Poisson model
data {
int<lower=0> N; // number of spatial units or neighbourhoods in Campina Grande
int<lower=0> N_edges; // number of edges connecting adjacent areas using Queens contiguity
array[N_edges] int<lower=1, upper=N> node1; // list of index areas showing which spatial units are neighbours
array[N_edges] int<lower=1, upper=N> node2; // list of neighbouring areas showing the connection to index spatial unit
array[N] int<lower=0> Y; // dependent variable i.e., number of households infested
vector<lower=1>[N] offset; // total number of houses in neoighbourhoods are used as an offset variable
int<lower=1> K; // number of independent variables i.e., K = 4
matrix[N, K] X; // independent variables in matrix form i.e., temp, prec, ndvi, and urban
}
transformed data {
vector[N] log_offset = log(offset);
}
parameters {
real alpha; // intercept
vector[K] beta; // covariates
real<lower=0> sigma; // overall standard deviation
real<lower=0, upper=1> rho; // proportion unstructured vs. spatially structured variance
vector[N] theta; // unstructure random effects (heterogeneous)
vector[N] phi; // spatial random effects
}
transformed parameters {
vector[N] combined; // combined random effect i.e., unstructure and structured
combined = sqrt(1 - rho) * theta + sqrt(rho) * phi; // formulation for the combined random effect i.e., unstructure and structured
}
model {
Y ~ poisson_log(log_offset + alpha + X * beta + combined * sigma); // likelihood function: multivariable Poisson ICAR regression model
alpha ~ normal(0.0, 1.0); // prior for alpha: weakly informative
beta ~ normal(0.0, 1.0); // prior for betas: weakly informative
theta ~ normal(0.0, 1.0); // prior for theta: weakly informative
sigma ~ normal(0.0, 1.0); // prior for sigma: weakly informative
rho ~ beta(0.5, 0.5); // prior for rho: pulled for literature
target += -0.5 * dot_self(phi[node1] - phi[node2]); // prior for phi: using a conditional autoregressive distribution
sum(phi) ~ normal(0, 0.001 * N); // soft sum-to-zero constraint on phi, and its the equivalent to mean(phi) ~ normal(0,0.001)
}
generated quantities {
vector[N] eta = alpha + X * beta + combined * sigma; // compute eta and exponentiate into mu
vector[N] rr_mu = exp(eta); // output the neighbourhood-specific relative risks in mu
vector[K] rr_beta = exp(beta); // output the risk ratios for each coefficient for the independent variables
real rr_alpha = exp(alpha); // output the risk ratios for the intercept
}
Make sure to save the Stan script as ICAR_model_baseline.stan
the working directory.
6.1.5 Compiling Stan code for the Baseline ICAR model
Now, let us turn our attention back to RStudio. Use the stan()
function to compile the saved script to obtain the posterior estimation of the parameters from our baseline model:
# starts timer
ptm <- proc.time()
# Bayesian model
Bayesian.model.baseline = stan("ICAR_model_baseline.stan",
data=stan.spatial.baseline,
iter=10000, chains=6, verbose = FALSE,
control = list(max_treedepth = 12))
# stops and calculates how long the model took to process with 10000 iterations on 6 cores
proc.time() - ptm
6.1.5.1 Rapid diagnostics
6.1.5.1.1 First check: rHATs are less than 1.05
Before interpreting the overall risks, as well as mapping them, we must check if there are any parameter estimatess i.e., alpha
, beta
, rr_alpha
, rr_beta
, sigma
, all phi
and lp__
whose rHAT exceed the threshold value of 1.05. This is an indication that the iterations did not perform well if an rHAT
for a parameter is above 1.05
meaning that samples did not converge to a (plausible) value. We can do a rapid checks to see which parameters are valid, or not valid, by creating a binary variable of 1
’s (i.e., Valid if rHAT
< 1.05), otherwise 0
’s (Not valid).
For instance, running this code would should the summary table to its fullest:
# print full table to avoid some rows from being omitted.
options(max.print = 100000)
# print the all results for quick diagnostics
print(Bayesian.model.baseline,
pars=c("alpha", "beta", "rr_alpha", "rr_beta", "rr_mu", "sigma", "phi", "lp__"),
probs=c(0.025, 0.975)
)
You can eyeball the column with the rHAT
values in the printed output. Or, you can use the code to tabulate it to see the numbers - if its only a frequency for the value of 1, then you are fine:
# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(
summary(Bayesian.model.baseline, pars=c("alpha", "beta", "rr_alpha", "rr_beta", "rr_mu", "sigma", "phi", "lp__"),
probs=c(0.025, 0.5, 0.975))$summary
)
# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)
# tabulate and check the rHATs is below 1.05 for all parameter estimates
table(diagnostic.checks$valid)
# if there is a frequency for `0` than the test has failed
# you must perform the MCMC sampling again with a bigger number of iterations
Everything is okay - all outputted parameters have an rHAT < 1.05
.
6.1.5.1.2 Second check: Make sure that all/most samples have converged
In my computer, we computed a total of 60,000 iterations across the 6 chains (i.e., 10,000 iterations), of which half the number is discarded. Hence, we have a sample of 30,000 samples generated for each parameter. Now, we need to check for samples that did not converge and ensure that does not exceed 10%.
# check across each chains for samples that have diverged (should not exceed 10% of the total 30000)
d1 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.baseline, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6
# tells you how many samples per chain did not converged (i.e., diverged)
d1
d2
d3
d4
d5
d6
# Check the proportion of diverged samples is greater than 10
(d1+d2+d3+d4+d5+d6)/30000 * 100 > 10
# if `TRUE` than the test has failed
# you must perform the MCMC sampling again with a bigger number of iterations
Everything is okay - it shows FALSE so most samples have converged. We can now proceed to the extraction of our results!
NOTES: To avoid such complications, it is always to best to run about 10000
, 15000
or more iterations with the argument control = list(max_treedepth = 12)
included in the stan()
function. Usually, shorter iterations yield low effective sample sizes after warm-up samples are discarded, which in turn, may lead to complications that may cause the rHAT
to be above 1.05, or samples that won’t converge. Notice, how today we have used up to 10000
iterations!
6.1.5.2 Extraction of results to user-friendly format
We are going to extract the results from the baseline model object. Pay close attention to this section - it shows how to extract all the necessary information and building the results into a table to be inserted into a document.
6.1.5.2.1 Extract the global results for relative risks and exceedence probabilities
This section computes and extracts the exceedance probabilities for the baseline relative risk for the intercept, and for each independent variable in the order of Temperature
, Precipitation
, NDVI
and Urbanisation
:
# print summary table
print(Bayesian.model.baseline,
pars=c("rr_alpha", "rr_beta", "sigma"),
probs=c(0.025, 0.975)
)
# compute exceedance probability for the intercept/baseline risk is bigger null value 1
threshold.intercept <- function(x){mean(x > 1.00)}
rr_alpha.exc.probs <- Bayesian.model.baseline %>% spread_draws(rr_alpha) %>%
summarise(rr_alpha=threshold.intercept(rr_alpha)) %>%
pull(rr_alpha)
# compute exceedance probability for each of the independent variables/risk are bigger than 1
threshold.coefficients <- function(x){mean(x > 1.00)}
rr_beta.exc.probs <- Bayesian.model.baseline %>% spread_draws(rr_beta[i]) %>%
group_by(i) %>% summarise(rr_beta=threshold.coefficients(rr_beta)) %>%
pull(rr_beta)
# reports exceedance probabilities
rr_alpha.exc.probs
rr_beta.exc.probs
6.1.5.2.2 Building a table of results for the baseline regression model
This is good’ol data management code to prepare the results into something that’s user-friendly:
# creating a table
# step 1: create the names in the appropriate order as there were modeled in Stan
names <- c("Baseline Risk", "Temperature", "Precipitation", "NDVI", "Urbanisation")
# step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.baseline, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)
# step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)
# step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX) using the paste() function
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")
# step 5: pull the exceedance probabilities into the result table
a <- c(rr_alpha.exc.probs, rr_beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")
# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]
# view final output and... presto! not too shabby!
final.output
The table with global results should look something like:
variables beta_95CrI Uncertainty ess rhat
1 Baseline Risk 1.467 (95% CrI: 0.1215 to 6.3608) Prob = 0.4597 29443 1.0001
2 Temperature 0.9307 (95% CrI: 0.8295 to 1.0414) Prob = 0.1030 10581 1.0004
3 Precipitation 1.1547 (95% CrI: 1.0353 to 1.2855) Prob = 0.9944 8654 1.0006
4 NDVI 0.929 (95% CrI: 0.6228 to 1.3344) Prob = 0.3102 14313 1.0001
5 Urbanisation 1.1793 (95% CrI: 0.5638 to 2.1582) Prob = 0.6279 13990 0.9999
Now, preserve this Global regression result in a spreadsheet:
6.1.5.2.3 Mapping the relative risk and significance for the baseline result
Same approach covered in Week 4’s computer practical (sections 5.5.3, 5.5.4 and 5.5.5).
Extract the area-specific relative risks for the neighbourhoods:
# extraction key posterior results for the generated quantities
relativeRisk.results <- as.data.frame(summary(Bayesian.model.baseline, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"
Merge the results to the shapefile:
# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]
Create the significance categories:
# create categories to define if an area has significant increase or decrease in risk, or nothing all
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1 # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1 # SIGNIFICANT DECREASE
# use these functions to understand the distribution of these risk estimates
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)
Compute the exceedance probabilities for the areas and categorise them in bands of 10s with their map labels:
# compute the exceedance probabilities for the areas
threshold <- function(x){mean(x > 1.00)}
excProbrr <- Bayesian.model.baseline %>% spread_draws(rr_mu[i]) %>%
group_by(i) %>% summarise(rr_mu=threshold(rr_mu)) %>%
pull(rr_mu)
# insert the exceedance values into the spatial data frame
spatial.data$excProb <- excProbrr
# categorising the probabilities in bands of 10s
spatial.data$ProbCat <- NA
spatial.data$ProbCat[spatial.data$excProb>=0 & spatial.data$excProb< 0.01] <- 1
spatial.data$ProbCat[spatial.data$excProb>=0.01 & spatial.data$excProb< 0.10] <- 2
spatial.data$ProbCat[spatial.data$excProb>=0.10 & spatial.data$excProb< 0.20] <- 3
spatial.data$ProbCat[spatial.data$excProb>=0.20 & spatial.data$excProb< 0.30] <- 4
spatial.data$ProbCat[spatial.data$excProb>=0.30 & spatial.data$excProb< 0.40] <- 5
spatial.data$ProbCat[spatial.data$excProb>=0.40 & spatial.data$excProb< 0.50] <- 6
spatial.data$ProbCat[spatial.data$excProb>=0.50 & spatial.data$excProb< 0.60] <- 7
spatial.data$ProbCat[spatial.data$excProb>=0.60 & spatial.data$excProb< 0.70] <- 8
spatial.data$ProbCat[spatial.data$excProb>=0.70 & spatial.data$excProb< 0.80] <- 9
spatial.data$ProbCat[spatial.data$excProb>=0.80 & spatial.data$excProb< 0.90] <- 10
spatial.data$ProbCat[spatial.data$excProb>=0.90 & spatial.data$excProb< 1.00] <- 11
spatial.data$ProbCat[spatial.data$excProb == 1.00] <- 12
# create the map labels for the probabilities
ProbCategorylist <- c("<0.01", "0.01-0.09", "0.10-0.19", "0.20-0.29", "0.30-0.39", "0.40-0.49","0.50-0.59", "0.60-0.69", "0.70-0.79", "0.80-0.89", "0.90-0.99", "1.00")
Checking the labels and getting a feel of its distribution - to inform how I will be build my maps and whether there is balance in the labels etc:
# use these functions to understand the distribution of these risk estimates
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)
# check to see if legend scheme is balanced
table(spatial.data$ProbCat)
> summary(spatial.data$rr)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.7028 1.2050 1.8056 1.7339 2.3294 2.9316
> hist(spatial.data$rr)
> table(spatial.data$Significance)
-1 0 1
4 10 33
>
> # check to see if legend scheme is balanced
> table(spatial.data$ProbCat)
1 2 3 10 11 12
3 5 1 4 8 26
NOTES: The relative estimates basically ranges from 0.70 to 2.93. There are 4 neighbourhoods with a reduced risk that’s significant. About 10 neighbourhoods that is not significant and 33 neighbourhoods with a significant increase risk. In terms of exceedance probabilities, there are 26 neighbourhoods predicted to have 100% chance of having increased risk of infestation. 8 neighbourhoods predicted to have 90-99% chance of having an increased risk of infestation, 4 with 80-89% chance, 1 neighbourhood with 10-19% chance, 8 neighbourhoods in total with <1-9% chances of infestation.
Create the maps and save as .png format:
# map of relative risk
rr_map <- tm_shape(spatial.data) +
tm_fill("Risk_LiraA_2017_1", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# map of significance regions
sg_map <- tm_shape(spatial.data) +
tm_fill("Sign_LiraA_2017_1", style = "cat", title = "Significance Categories",
palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
tmap_save(rr_map, "rr_2017-1.png", height = 12)
tmap_save(sg_map, "sign_2017-1.png", height = 12)
tmap_save(ep_map, "ep_2017-1.png", height = 12)
We have completed the first part of the analysis with the baseline model. We are now going to implement the Bayesian updating with the April and July data set. Before we do any thing - make sure to save these results to avoid losing this work. We will the data set and use them later to chart the trajectory of risks to see if certain areas have sustained risk across all surveys periods.
6.1.6 Apply the Bayesian updating on subsequent data sets
We are going to implement the Bayesian updating on the following data frames: 1.) Lira_2017_2
and Lira_2017_3
. Therefore, we will need to create the subsequent data sets for the updating in the same manner as shown in section 7.2.3.2. Let’s start with Lira_2017_2
and create a new list object called stan.dataset.2017_2
.
y=Lira2017_2$InfestedNum
x=Lira2017_2[,c(9,10,11,12)]
e=Lira2017_2$Expected
stan.dataset.2017_2=list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), offset=e)
NOTES: The y
, x
and e
will be overwritten.
Now, in the stan()
function, instead of calling the Stan script for compilation - we will use the previous model object Bayesian.model.baseline
in the fit=
argument. The new dataset will be passed on to the data=
argument. The updated posterior estimates for the relative risk will be passed into a new object called Bayesian.model.updated1.2017_2
.
# Start the clock and time duration
ptm <- proc.time()
# Updated Bayesian model
Bayesian.model.updated1.2017_2 = stan(fit=Bayesian.model.baseline, data=stan.dataset.2017_2, iter=60000, chains=6, verbose = FALSE, control = list(max_treedepth = 12))
# Stop the clock and report duration
proc.time() - ptm
Repeated the updating process again using the data set Lira_2017_3
and previous model Bayesian.model.updated1.2017_2
. The updated posterior estimates for the relative risk will be passed into a new object called Bayesian.model.updated2.2017_3
.
y=Lira2017_3$InfestedNum
x=Lira2017_3[,c(9,10,11,12)]
e=Lira2017_3$Expected
stan.dataset.2017_3=list(N=n, N_edges=n_edges, node1=nod1, node2=nod2, Y=y, X=x, K=length(x), offset=e)
ptm <- proc.time()
Bayesian.model.updated2.2017_3 = stan(fit=Bayesian.model.updated1.2017_2, data=stan.dataset.2017_3, iter=60000, chains=6, verbose = FALSE, control = list(max_treedepth = 12))
proc.time() - ptm
6.1.6.1 Extraction of updated results to a user-friendly format
The code for extraction is essentially the same as what’s shown in section 7.2.5. Here is the code for extracting the updated results from object Bayesian.model.updated1.2017_2
:
[Click here to see code]:
# repeat the data management - Bayesian.model.updated1.2017_2
# replace all key objects with Bayesian.model.updated1.2017_2
# diagnostics
# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars=c("alpha", "beta", "sigma", "phi"), probs=c(0.025, 0.5, 0.975))$summary)
# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)
# tabulate it
table(diagnostic.checks$valid)
# check rHAT < 1.05
# check chains for samples that diverged (should not exceed 10% of the total 180000)
d1 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.updated1.2017_2, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6
(d1+d2+d3+d4+d5+d6)/180000 * 100
# extract the exceedance probabilities for the intercept & each beta coefficient
# note: the previous alpha.exc.probs and beta.exc.probs will be overwritten!
threshold.intercept <- function(x){mean(x > 0.00)}
alpha.exc.probs <- Bayesian.model.updated1.2017_2 %>% spread_draws(alpha) %>%
summarise(alpha=threshold.intercept(alpha)) %>%
pull(alpha)
threshold.coefficients <- function(x){mean(x > 0.00)}
beta.exc.probs <- Bayesian.model.updated1.2017_2 %>% spread_draws(beta[i]) %>%
group_by(i) %>% summarise(beta=threshold.coefficients(beta)) %>%
pull(beta)
# reports updated exceedance probabilities from Bayesian.model.updated1.2017_2
alpha.exc.probs
beta.exc.probs
# Step 1: create the names in the appropriate order as there were modelled in Stan
names <- c("Intercept", "Temperature", "Precipitation", "NDVI", "Urbanisation")
# Step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)
# Step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)
# Step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX)
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")
# Step 5: pull the exceedance probabilities into the result table
a <- c(alpha.exc.probs, beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")
# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]
# view final output and... presto! not too shabby!
final.output
# Now export regression results into a .csv file
# NOTE: Update this line of code and save the file with a different name
write.csv(final.output, file = "Updated_Result_April_2017.csv", row.names = FALSE)
# spatial maps of relative risks
# extraction of key updated posterior results for the generated quantities
relativeRisk.results <- as.data.frame(summary(Bayesian.model.updated1.2017_2, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"
# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]
# create categories to define if an area has significant increase or decrease in risk, or nothing all
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1 # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1 # SIGNIFICANT DECREASE
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)
# we create these column to preserve this result for the updates from April
# NOTE: This code needs to change '_2' to show its for April
colnames(spatial.data)[5] <- "Risk_LiraA_2017_2"
colnames(spatial.data)[8] <- "Sign_LiraA_2017_2"
spatial.data$LiraA_2017_2 <- paste(round(spatial.data$Risk_LiraA_2017_2,2), " (95% CrI:", round(spatial.data$rrlower,2), ",", round(spatial.data$rrupper,2), ")", sep = "")
campina_grande_neighbourhoods$NeighNum <- as.numeric(campina_grande_neighbourhoods$NeighNum)
# map of relative risk
# NOTE: This code needs to change '_2' to show its for April
rr_map <- tm_shape(spatial.data) +
tm_fill("Risk_LiraA_2017_2", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# map of significance regions
# NOTE: This code needs to change '_2' to show its for April
sg_map <- tm_shape(spatial.data) +
tm_fill("Sign_LiraA_2017_2", style = "cat", title = "Significance Categories",
palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# NOTE: This code needs to change '-2' to show its for April
tmap_save(rr_map, "rr_2017-2.png", height = 12)
tmap_save(sg_map, "sign_2017-2.png", height = 12)
# chart the trajectory of risks to see if the remain sustained across all LIRAa surveys
# create a sink for data posterior estimates and build
# NOTE: This code needs to change '_2' to show its for April
risk_data_2 <- st_drop_geometry(spatial.data[,c("osmID", "NeighNum", "Risk_LiraA_2017_2", "Sign_LiraA_2017_2", "LiraA_2017_2")])
# all results will be sinked into this subsequent file
# NOTE: This code needs to change '-2' to show its for April
write.csv(risk_data_2, file = "Updated 2017-2.csv", row.names = FALSE)
Here is the code for extracting the updated results from object Bayesian.model.updated2.2017_3
:
[Click here to see code]:
# again, repeat the data management for Bayesian.model.updated2.2017_3
# replace all key objects with Bayesian.model.updated2.2017_3
# diagnostics
# diagnostic check on the rHats - put everything into a data frame
diagnostic.checks <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars=c("alpha", "beta", "sigma", "phi"), probs=c(0.025, 0.5, 0.975))$summary)
# create binary variable
diagnostic.checks$valid <- ifelse(diagnostic.checks$Rhat < 1.05, 1, 0)
# tabulate it
table(diagnostic.checks$valid)
# check rHAT < 1.05
# check chains for samples that diverged (should not exceed 10% of the total 180000)
d1 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[1]][, 'divergent__'] %>% sum() # chain 1
d2 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[2]][, 'divergent__'] %>% sum() # chain 2
d3 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[3]][, 'divergent__'] %>% sum() # chain 3
d4 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[4]][, 'divergent__'] %>% sum() # chain 4
d5 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[5]][, 'divergent__'] %>% sum() # chain 5
d6 <- get_sampler_params(Bayesian.model.updated2.2017_3, inc_warmup = F)[[6]][, 'divergent__'] %>% sum() # chain 6
(d1+d2+d3+d4+d5+d6)/180000 * 100
# extract the exceedance probabilities for the intercept & each beta coefficient
# note: the previous alpha.exc.probs and beta.exc.probs will be overwritten!
threshold.intercept <- function(x){mean(x > 0.00)}
alpha.exc.probs <- Bayesian.model.updated2.2017_3 %>% spread_draws(alpha) %>%
summarise(alpha=threshold.intercept(alpha)) %>%
pull(alpha)
threshold.coefficients <- function(x){mean(x > 0.00)}
beta.exc.probs <- Bayesian.model.updated2.2017_3 %>% spread_draws(beta[i]) %>%
group_by(i) %>% summarise(beta=threshold.coefficients(beta)) %>%
pull(beta)
# reports updated exceedance probabilities from Bayesian.model.updated2.2017_3
alpha.exc.probs
beta.exc.probs
# Step 1: create the names in the appropriate order as there were modelled in Stan
names <- c("Intercept", "Temperature", "Precipitation", "NDVI", "Urbanisation")
# Step 2: now, pull all global relative risk results into the object called 'results'
results <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars = c("rr_alpha", "rr_beta"), probs = c(0.025, 0.975))$summary)
results$variables <- names
row.names(results) <- 1:nrow(results)
# Step 3: fudge with the results to make it look clean
results <- results[,c(8, 1, 4, 5, 6, 7)]
results$mean <- round(results$mean, 4)
colnames(results)[2] <- "coefficient"
colnames(results)[3] <- "lower95"
colnames(results)[4] <- "upper95"
colnames(results)[5] <- "ess"
colnames(results)[6] <- "rhat"
results$lower95<- round(results$lower95, 4)
results$upper95 <- round(results$upper95, 4)
results$ess <- round(results$ess, 0)
results$rhat <- round(results$rhat, 4)
# Step 4: stitch the credibility results as text so it reads as (95% CrI: XXX to XXX)
results$beta_95CrI <- paste(results$coefficient, " (95% CrI: ", results$lower95, " to ", results$upper95, ")", sep = "")
# Step 5: pull the exceedance probabilities into the result table
a <- c(alpha.exc.probs, beta.exc.probs)
results$ExceedanceProb <- round(a, 4)
results$Uncertainty <- paste("Prob = ", results$ExceedanceProb, sep = "")
# Step 6: finally, more fudging around with the results to make it look cleaner and done!
final.output <- results[,c(1, 7, 9, 5, 6)]
# view final output and... presto! not too shabby!
final.output
# Now export regression results into a .csv file
# NOTE: Update this line of code and save the file with a different name
write.csv(final.output, file = "Updated_Result_July_2017.csv", row.names = FALSE)
# spatial maps of relative risks
# extraction of key updated posterior results for the generated quantities
relativeRisk.results <- as.data.frame(summary(Bayesian.model.updated2.2017_3, pars=c("rr_mu"), probs=c(0.025, 0.975))$summary)
# now cleaning up this table up
# first, insert clean row numbers to new data frame
row.names(relativeRisk.results) <- 1:nrow(relativeRisk.results)
# second, rearrange the columns into order
relativeRisk.results <- relativeRisk.results[, c(1,4,5,7)]
# third, rename the columns appropriately
colnames(relativeRisk.results)[1] <- "rr"
colnames(relativeRisk.results)[2] <- "rrlower"
colnames(relativeRisk.results)[3] <- "rrupper"
colnames(relativeRisk.results)[4] <- "rHAT"
# template for merging spatial risk data
spatial.data <- campina_grande_neighbourhoods
# now, we proceed to generate our risk maps
# align the results to the areas in shapefile
spatial.data$rr <- relativeRisk.results[, "rr"]
spatial.data$rrlower <- relativeRisk.results[, "rrlower"]
spatial.data$rrupper <- relativeRisk.results[, "rrupper"]
# create categories to define if an area has significant increase or decrease in risk, or nothing all
spatial.data$Significance <- NA
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper>1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower==1 | spatial.data$rrupper==1] <- 0 # NOT SIGNIFICANT
spatial.data$Significance[spatial.data$rrlower>1 & spatial.data$rrupper>1] <- 1 # SIGNIFICANT INCREASE
spatial.data$Significance[spatial.data$rrlower<1 & spatial.data$rrupper<1] <- -1 # SIGNIFICANT DECREASE
summary(spatial.data$rr)
hist(spatial.data$rr)
table(spatial.data$Significance)
# we create these column to preserve this result for the updates from April
# NOTE: This code needs to change '_3' to show its for April
colnames(spatial.data)[5] <- "Risk_LiraA_2017_3"
colnames(spatial.data)[8] <- "Sign_LiraA_2017_3"
spatial.data$LiraA_2017_3 <- paste(round(spatial.data$Risk_LiraA_2017_3,2), " (95% CrI:", round(spatial.data$rrlower,2), ",", round(spatial.data$rrupper,2), ")", sep = "")
campina_grande_neighbourhoods$NeighNum <- as.numeric(campina_grande_neighbourhoods$NeighNum)
# map of relative risk
# NOTE: This code needs to change '_3' to show its for April
rr_map <- tm_shape(spatial.data) +
tm_fill("Risk_LiraA_2017_3", style = "cont", title = "Relavtive Risk [RR]", palette = "OrRd") +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# map of significance regions
# NOTE: This code needs to change '_3' to show its for April
sg_map <- tm_shape(spatial.data) +
tm_fill("Sign_LiraA_2017_3", style = "cat", title = "Significance Categories",
palette = c("#33a6fe", "white", "#fe0000"), labels = c("Significantly low", "Not Significant", "Significantly high")) +
tm_shape(campina_grande_neighbourhoods) + tm_polygons(alpha = 0.10) + tm_text("NeighNum") +
tm_layout(frame = FALSE, legend.outside = TRUE, legend.title.size = 1.2, legend.text.size = 1) +
tm_compass(position = c("right", "top")) + tm_scale_bar(position = c("right", "bottom"))
# NOTE: This code needs to change '-3' to show its for April
tmap_save(rr_map, "rr_2017-3.png", height = 12)
tmap_save(sg_map, "sign_2017-3.png", height = 12)
# chart the trajectory of risks to see if the remain sustained across all LIRAa surveys
# create a sink for data posterior estimates and build
# NOTE: This code needs to change '_3' to show its for July
risk_data_3 <- st_drop_geometry(spatial.data[,c("osmID", "NeighNum", "Risk_LiraA_2017_3", "Sign_LiraA_2017_3", "LiraA_2017_3")])
# all results will be sinked into this subsequent file
# NOTE: This code needs to change '-3' to show its for April
write.csv(risk_data_3, file = "Updated 2017-3.csv", row.names = FALSE)
6.1.7 List of key outputs
- The regression results with baseline and updated relative risks stored in 3 separate tables in following files:
- January (Baseline):
Baseline_Result_January_2017.csv
- April (Updated):
Updated_Result_April_2017.csv
- July (Updated):
Updated_Result_July_2017.csv
- 3 maps that show relative risks and 3 maps that show significant neighbourhoods. See the following files:
- January (Baseline):
rr_2017-1.png
andsign_2017-1.png
- April (Updated):
rr_2017-2.png
andsign_2017-2.png
- July (Updated):
rr_2017-3.png
andsign_2017-3.png
- The relative risk and significance categories from baseline and update models stored in 3 separate tables. These are for mapping and charting those risk trajectories. See the following files:
- January (Baseline):
Baseline 2017-1.csv
- April (Updated):
Updated 2017-2.csv
- July (Updated):
Updated 2017-3.csv
IMPORTANT: Note that these tables are used to create to risk trajectories shown in Week 9’s lecture on slide 39. The columns NeighNum
, Sign_LiraA_2017_1
, Sign_LiraA_2017_2
and Sign_LiraA_2017_3
are arranged in a separate excel spreadsheet and coloured manually in Blue for -1 (significantly lower risks), Grey for 0 (Not significant) and Red for 1 (significantly higher risks).
6.2 Finished!
And that’s all to it. That’s how we conduct Bayesian inference - I hope you’ve enjoyed learning this subject. Well done for getting through these practicals! We know, they were long and a grueling process - but you have done well to learn this branch of statistics in the Advanced Topics in Social & Geographic Data Sciences. As an appreciation, we end with compliments by expression through this iconic scene from one of our all time great anime, Neon Genesis Evangelion.