3 Bayesian Generalised Linear Models (GLMs)
3.1 Lecture recording (Length: 45:46 minutes)
Important Notes: If this or any subsequent video don’t play from this website, then try playing it directly from Microsoft Streams - you should be able to view it by clicking this [LINK]. Access to Microsoft Steams to these videos will require use of your UCL institutional login details.
3.2 Introduction
This week we will learn how to perform Generalised Linear Modelling (GLMs) within a Bayesian framework. We will use quantitative data focused on crime victimization in Nigeria which have resulted in a number of interesting publications and see its application to a real study problem. Recall, the all forms of linear models only take continuous outcome as a dependent variable, this feature of a linear model can be limiting when have data following either a binary, count or categorical structure. A GLM typically consist of three components which allows a user to fit different dependent variable types:
- A probability distribution to specify the dependent variable’s conditional distribution;
- A linear predictor, which is the function of the predictors (or independent variables);
- A link function that connects point 1 to point 2.
These models are often applied within a frequentist way of thinking; however, these are also easily estimated in a Bayesian setting. We going to take it a step further and show you how one can use Stan to encode various regression model types from a Bayesian perspective. We will specifically focus on models with likelihood functions that follows either a Binomial or Poisson distribution as they are used for risk assessments.
3.2.1 Learning outcomes
Today’s session aims to formally introduce you to Stan programming for Bayesian regression models. By the end of this session, you should be able to perform the following:
- How to select the appropriate likelihood function specification for the Bayesian regression model i.e., binomial or Poisson to model either binary, or count outcomes respectively;
- How to fully develop Stan code for such regression models with the appropriate prior (i.e., uninformative, weak or informative) specification for various parameters;
- How to interpret the various types of coefficients e.g., Odds Ratios (OR) and Risk Ratios (RR);
You can follow the live walkthrough demonstration delivered in the first 1-hour of the practical, and then use the remaining half to try the practical tutorials and follow the instructions on your own.
3.2.2 Demonstration recording (Length: 52:52 minutes)
Important Notes: If this or any subsequent video don’t play from this website, then try playing it directly from Microsoft Streams - you should be able to view it by clicking this [LINK]. Access to Microsoft Steams to these videos will require use of your UCL institutional login details.
3.2.3 Datasets & setting up the work directory
Go to your folder GEOG0125 and create a sub folder called “Week 2” within the GEOG0125 folder. Here, we will store all our R & Stan scripts as well as datasets. Set your work directory to Week 2’s folder.
For Windows, the code for setting the work directory will be:
For MAC, the code for setting the work directory will be:
The dataset for this practical are:
Street Burglary Data in Nigeria.csv
Household victimisation data in Nigeria.csv
London LSOA 2015 data.csv
Context about the dataset: Conventional analyses of crime, based on European research models, are often poorly suited to assessing the specific dimensions of criminality in Africa. The data used in today’s practical is an anonymised excerpt from the Development Frontiers in Crime, Livelihoods and Urban Poverty in Nigeria (FCLP) project that aimed to provide an alternative framework for understanding the specific drivers of criminality in a West African urban context. This research project used a mixed-methods approach for combining statistical modeling, geovisualisation and ethnography, and attempted to situate insecurity and crime against a broader backdrop of rapid urban growth, seasonal migration, youth unemployment and informality. The study typically provided researchers both in Nigeria and internationally a richer and more nuanced evidence base on the particular dynamics of crime from an African perspective resulting a number of publications: [1], [2] and [3] .
3.2.4 Loading and installing packages
We will need to load the following packages from previous practicals:
rstan
: a package that enablesR
to interface withStan
. It provides R the functions to code, parse, compile, test, estimate, and analyze Bayesian models throughStan
in R.
Note that when you load rstan
from cran you will see some recommendations on using multiple cores for speeding the process. For the best experience, we highly recommend using this code:
This tells RStudio to use multiple core for parallel process whenever Stan is being implemented. Every time you want to use Stan - make sure to load parallel::detectCores()
and rstan_options
code.
3.3 Poisson Regression Modelling
We are going to fit a Poisson-type model on some an outcome that contain discrete counts of households reporting to have been victims of burglary. Let us load the data into RStudio can call the object burglaryDataset
.
# Set your own directory using setwd() function
# Load data into RStudio using read.csv(). The spreadsheet is stored in the object called 'burglaryData'
burglaryDataset <- read.csv("Street Burglary Data in Nigeria.csv")
names(burglaryDataset)
3.3.1 Selecting the appropriate Poisson model
There are three different types of Poisson models:
- Standard Poisson regression
- Negative Binomial Poisson regression
- Zero-Inflated Poisson regression
The implementation of one of these models are highly dependent on how the frequency distribution of the count response variable are displayed. If it resembles a normal curve - then use the standard version. Otherwise, use the Negative Binomial Poisson regression if there is any evidence of over-dispersion. When there is a an inflation of zero counts in the dataset, you will have to use the Zero-Inflated model to account for the problem.
Let’s check the frequency distribution of the outcome variable Burglary
which corresponds to the number of reported instances a property on a street was burgled. You can simply use a histogram to examine its distribution:
The plot show evidence of over-dispersion. It indicates that streets in Kaduna have less frequency of burglaries. Here, we consider using a Negative Binomial Poisson regression model over the standard and zero-inflated versions (i.e., scenario 1 and 3). Now, that we know the model type, let prepare the dataset of our independent variables.
3.3.2 Data preparation and set-up for Bayesian analysis in Stan
Stan is able to handle continuous variable nicely; but for some strange reasons it’s absolutely terrible with categorical variables! It can only handle them as dummy variables - meaning that the categorical variable must be split into a series of binary variables. You will see what I mean in a second. The following variables: choice
, integration
, business activities
and socioeconomic deprivation
, at a street-level are examples of categorical variables which needs to be converted to dummy variables.
They need to be make as a factor using the as.factor()
function before implementing the model.matrix()
to apply the conversion. Here is the code:
# change the variable from numeric to a factor variable using the as.factor() function
burglaryDataset$choice_q<- as.factor(burglaryDataset$choice_q)
burglaryDataset$integ_q <- as.factor(burglaryDataset$integ_q)
burglaryDataset$business_q <- as.factor(burglaryDataset$business_q)
burglaryDataset$socioeconomic_q <- as.factor(burglaryDataset$socioeconomic_q)
Next, we use the model.matrix()
function to apply the conversion:
# extract all the important independent variables to be used in model
SelectedVariables <- burglaryDataset[,c(4,5,6,7,8,9)]
# convert only variable that are factor variables
X <- model.matrix(~ 0 + connectivity + choice_q + integ_q + business_q + socioeconomic_q, data = SelectedVariables)
# see conversion
View(X)
Drop any column that is the first category - i.e., for instance choice_q1 is present. This needs to be dropped as its the referent category.
Our independent variables are stored in the matrix X
. Will use this specify this in the models when programming it in Stan. Now, we need to define our dependent variable y
i.e., number of reported burglaries on a street, as well as the total numbers of houses on a street as an offset - these two quantities will also need to be specified in our Stan
code.
# declare the dependent variable
y <- burglaryDataset$burglary
# declare totalhouse column as the denominators to be imputed as an offset in the model
denominators <- burglaryDataset$totalhouses
Now, we need to condense this information into list() object as our data for Stan to read:
stan.negbin.dataset <- list(N=length(y), X=X, k=ncol(X), phi_scale=0.9, y=y, denominators=denominators)
Important Notes:
N = length(y)
we are extracting the number of observations instead of fully typing the number. Here,N
is743
X=X
: This is the set of independent variables we created earlier on with the dummy variables etc., It should contain 16 columns.k=ncol(X)
: We are extracting the number of columns fromX
phi_scale = 0.9
: Here, we specify our over-dispersion parameter as1
y=y
: Here, we defined the outcome variable (i.e., counts of burglaries) asy
denominators=denominators
: Here, we define thetotalhouse
to be the denominators in the model
At this point, the dataset is fully prepared are ready to go. Let’s create our Stan script for running a Negative Binomial Poisson regression within a Bayesian framework.
3.3.3 Creating a script to run a Negative Binomial Poisson regression in Stan
A typical Stan program for a regression onsist of the following 5 blocks:
- Data
- Transformed data
- Parameters
- Transformed parameters
- Model
- Generated quantities
The Data
, Parameters
and Model
block must be specified for the regression to work. But there will be additional things that we need to transform such as the inputted data, parameters and our estimated coefficients for our relative risk (RR). Let’s start with the data
block:
FIRST STEP: We specify the total number of observations int N
as well as the number of independent variables int k
as an integer in the data block. We define our y
as a vector
of size N
holding all the values for our dependent variable. We also define our independent variables a matrix X
with N
rows and k
columns (i.e, matrix[N, k] X
). Lastly, we define our overdispersion parameter as a real number real
and call it phi_scale
. We need to define the denominators
as a vector
:
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
Next we need to apply a log()
transformation on to the denominators
to change it as an offset for the model. Here, we are making a direct change to the data; hence, we can use the transformed data
block:
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
transformed data {
vector[N] offset = log(denominators);
}
SECOND STEP: For the parameters block, here we will need to specify the name of the regression intercept alpha
and the coefficients for all our independent variables. Here, we call defined it as a vector of size k
. The reciporcal_phi
is a precision estimate which we will need as a prior in our model.
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
transformed data {
vector[N] offset = log(denominators);
}
parameters {
real alpha; // intercept i.e., overall risk of burglary in population
vector[k] beta; // regression coefficients
real reciporcal_phi; // overdispersion
}
THIRD STEP: We define our regression model in the transformed parameters block. mu
represents the predicted number of burglaries, which is a vector of N
size (i.e., each street). The alpha + beta*X
is the regression model. phi
is the over-dispersion parameter accounted for in the model:
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
transformed data {
vector[N] offset = log(denominators);
}
parameters {
real alpha; // intercept i.e., overall risk of burglary in population
vector[k] beta; // regression coefficients
real reciporcal_phi; // overdispersion
}
transformed parameters {
vector[N] mu; // create mu our estimator for model
real phi; // define precision parameter
mu = offset + alpha + X*beta; // linear predictor for mu i.e., predicted burden of burglaries
phi = 1./reciporcal_phi; // precision (akin to variation)
}
FOURTH STEP: We build our likelihood function and specify the priors for each parameter under the model block:
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
transformed data {
vector[N] offset = log(denominators);
}
parameters {
real alpha; // intercept i.e., overall risk of burglary in population
vector[k] beta; // regression coefficients
real reciporcal_phi; // overdispersion
}
transformed parameters {
vector[N] mu; // create mu our estimator for model
real phi; // define precision parameter
mu = offset + alpha + X*beta; // linear predictor for mu i.e., predicted burden of burglaries
phi = 1./reciporcal_phi; // precision (akin to variation)
}
model {
reciporcal_phi ~ cauchy(0, phi_scale); // prior for reciporcal_phi using half-cauchy
alpha ~ normal(0, 10); // prior for intercept i.e., normal with mean 0 and ±10 variance
beta ~ normal(0, 2.5); // prior for beta i.e., normal with mean 0 and ±2.5 variance
y ~ neg_binomial_2_log(mu, phi); // likelihood function for burglaries neg_binomial_2_log()
}
LAST STEP: We instruct Stan on the parameters we want to report. We want them a relative risk ratio (RR) or should we say a crime risk ratio (CRR). We use the generated quantities block to obtain the estimates we need i.e., exponentiation pf the intercept and coefficients:
data {
int<lower = 1> N; // Number of observations
int<lower = 1> k; // Number of columns in X matrix contain independent variables and intercept column
int y[N]; // Dependent variable - Number of reported burglaries on a street segment
matrix[N,k] X; // Model matrix for our independent variables
real phi_scale;
vector<lower = 1>[N] denominators;
}
transformed data {
vector[N] offset = log(denominators);
}
parameters {
real alpha; // intercept i.e., overall risk of burglary in population
vector[k] beta; // regression coefficients
real reciporcal_phi; // overdispersion
}
transformed parameters {
vector[N] mu; // create mu our estimator for model
real phi; // define precision parameter
mu = offset + alpha + X*beta; // linear predictor for mu i.e., predicted burden of burglaries
phi = 1./reciporcal_phi; // precision (akin to variation)
}
model {
reciporcal_phi ~ cauchy(0, phi_scale); // prior for reciporcal_phi using half-cauchy
alpha ~ normal(0, 10); // prior for intercept i.e., normal with mean 0 and ±10 variance
beta ~ normal(0, 2.5); // prior for beta i.e., normal with mean 0 and ±2.5 variance
y ~ neg_binomial_2_log(mu, phi); // likelihood function for burglaries neg_binomial_2_log()
}
generated quantities {
vector[N] eta; // vector containing our predicted number of burglaries for each street
vector[k] crr; // vector containing our estimated crime risk ratios for each variable
real alpha_crr;
alpha_crr = exp(alpha);
eta = exp(mu); // overall risk ratio
crr = exp(beta); // risk ratio for each variable
}
COMPLIMENTS: Well done - you are doing well. You have coded your first Bayesian regression. Let’s save the script, and then compile and run it through RStudio to get our crime risk ratio (CRR) results.
3.3.4 Compiling our Stan code in RStudio
Now, let us turn our attention to RStudio. Using the stan()
to compile and obtain the posterior estimation of the overall risk and crime risk ratios (CRR) for the each independent variable:
# the directory needs to be set to where you saved the dataset and Stan script
crr.negbin.model = stan("Script for NegBin model.stan", data=stan.negbin.dataset, iter=5000, chains=3, verbose = FALSE)
# take roughly 1-2 minutes to compile
We can print the results accordingly:
# reports the crime rate ratio (relative risk)
print(crr.negbin.model, probs=c(0.025, 0.975), pars = c("alpha_crr", "crr", "lp__"))
# take roughly 1-2 minutes to compile
Output summary table
Inference for Stan model: anon_model.
3 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=7500.
mean se_mean sd 2.5% 97.5% n_eff Rhat
alpha_crr 0.12 0.00 0.04 0.07 0.21 3289 1
crr[1] 1.00 0.00 0.00 1.00 1.00 10232 1
crr[2] 1.07 0.00 0.03 1.01 1.13 5844 1
crr[3] 0.96 0.00 0.21 0.60 1.44 5245 1
crr[4] 1.03 0.00 0.25 0.63 1.60 4469 1
crr[5] 0.85 0.00 0.28 0.44 1.52 4323 1
crr[6] 0.99 0.00 0.22 0.64 1.48 5065 1
crr[7] 0.90 0.00 0.20 0.57 1.36 4608 1
crr[8] 0.85 0.00 0.21 0.51 1.34 4607 1
crr[9] 0.93 0.00 0.25 0.53 1.50 4888 1
crr[10] 0.70 0.00 0.17 0.42 1.10 4924 1
crr[11] 1.15 0.00 0.27 0.71 1.78 4700 1
crr[12] 1.21 0.00 0.29 0.73 1.88 4495 1
crr[13] 1.23 0.00 0.29 0.76 1.88 3971 1
crr[14] 0.95 0.00 0.22 0.59 1.46 4427 1
crr[15] 0.95 0.00 0.24 0.56 1.49 4366 1
crr[16] 1.25 0.01 0.32 0.74 1.98 4097 1
lp__ -1023.04 0.06 3.03 -1029.81 -1018.04 2670 1
Samples were drawn using NUTS(diag_e) at Thu Jan 19 22:33:08 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Here is the messing part - interpretation. Before anything, note that alpha_crr
corresponds to the intercept. Each crr[1]
, crr[2]
,… and so on corresponds to the ordering of the columns’ in X
matrix containing the independent variables. If you run this code of colnames(X)
- it will print this list of names:
Output
[1] "distance" "connectivity" "choice_q2" "choice_q3" "choice_q4"
[6] "integ_q2" "integ_q3" "integ_q4" "business_q2" "business_q3"
[11] "business_q4" "business_q5" "socioeconomic_q2" "socioeconomic_q3" "socioeconomic_q4"
[16] "socioeconomic_q5"
alpha_crr
is the risk ratio for the interceptcrr[1]
corresponds to the continuous distance variablecrr[2]
corresponds to the continuous connectivity variablecrr[3]
,crr[4]
andcrr[5]
corresponds to the 2nd, 3rd, 4th categories forchoice
variablecrr[6]
,crr[7]
andcrr[8]
corresponds to the 2nd, 3rd, 4th categories forintegration
variablecrr[9]
,crr[10]
,crr[11]
andcrr[12]
corresponds to the 2nd, 3rd, 4th and 5th categories forbusiness activities
variablecrr[13]
,crr[14]
,crr[15]
andcrr[16]
corresponds to the 2nd, 3rd, 4th and 5th categories forsocioeconomic deprivation
variable
Interpretation of interesting coefficients:
Connectivity (cont.): For a unit increase in the number of connections on a street segment significantly increases the risk of burglaries by 7% (CRR 1.07, 95% CrI: 1.01–1.13). or you can say the risk for burglaries in relation to a unit increase in the number of connections on a street are 1.07 times higher.
Business activities (cat.): Relative to street segments will lower levels of business activities, streets that are in the highest quintiles (i.e., higher levels of commerce) are more likely to report burglaries (4th category: CRR 1.15, 95% CrI: 0.71-1.78; 5th category: CRR 1.21, 95% CrI: 0.73-1.88)
3.4 Logistic Regression Modelling
3.4.1 Data preparation and set-up for Bayesian analysis in Stan
Running a logistic regression is very easy. It has household-level information regarding the security status of a property and victimisation (i.e., victim of burglary). Here are the list of variables:
Gate
: Is the property gated? (1 = yes, 0 = no)Garage
: Does the property contains a garage? (1 = yes, 0 = no)Outdoor_sitting
: Does property have an outdoor sitting? (1 = yes, 0 = no)Security_lights
: Are security lights attached to property? (1 = yes, 0 = no)BurglaryStatus
: Whether the household has been a victim of burglary (1 = yes, 0 = no) - outcome variable.
Let’s load this dataset i.e., Household victimisation data in Nigeria.csv
and attempt to implement a logistic regression model using the Bernoulli
likelihood function on a binary dependent variable that is BurglaryStatus
to assess the impacts of properties protected by some form of security.
Let’s get straight to the preparing the dataset for Stan:
# load data
houselevel.data <- read.csv("Household victimisation data in Nigeria.csv")
# declare binary variables as factor variables
houselevel.data$Gate<- as.factor(houselevel.data$Gate)
houselevel.data$Garage <- as.factor(houselevel.data$Garage)
houselevel.data$Outdoor_sitting <- as.factor(houselevel.data$Outdoor_sitting)
houselevel.data$Security_lights <- as.factor(houselevel.data$Security_lights)
# create a matrix for the independent variables
X <- model.matrix(~ 0 + Gate + Garage + Outdoor_sitting + Security_lights, data = houselevel.data)
# declare the dependent variable
y <- houselevel.data$BurglaryStatus
# compile all information into list() object for Stan to compile in logistic regression script
stan.logistic.dataset <- list(N=length(y), X=X, k=ncol(X), y=y)
3.4.2 Creating a script to run a logistic regression in Stan
Let’s get straight to writing the logistic regression script in Stan.
// Stan script for running logistic regression model
data {
int<lower = 1> N;
int<lower = 1> k;
int<lower = 0, upper = 1> y [N]; // Burglary status 1 = yes, 0 = no
matrix[N, k] X;
}
parameters {
real alpha;
vector[k] beta;
}
model {
y ~ bernoulli_logit(alpha + X*beta); // fit a logistic regression using a bernoulli_logit()
alpha ~ normal(0, 10);
beta ~ normal(0, 2.5);
}
generated quantities {
vector[k] orr;
real alpha_orr;
alpha_orr = exp(alpha);
orr = exp(beta);
}
Now, save the script with an appropriate name, and we can compile the code in RStudio with stan()
function:
orr.logistic.model = stan("Script for logit model.stan", data=stan.logistic.dataset, iter=5000, chains=3, verbose = FALSE)
Print the results for risk of burglaries in relation to target hardening security features:
# reports the odds ratios (OR)
print(orr.logistic.model, probs=c(0.025, 0.975), pars = c("alpha_orr", "orr", "lp__"))
Output summary table
Inference for Stan model: anon_model.
3 chains, each with iter=5000; warmup=2500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=7500.
mean se_mean sd 2.5% 97.5% n_eff Rhat
alpha_orr 0.16 0.00 0.01 0.13 0.19 4591 1
orr[1] 1.06 0.00 0.12 0.85 1.31 5589 1
orr[2] 1.04 0.00 0.23 0.64 1.56 6729 1
orr[3] 1.08 0.00 0.13 0.85 1.35 6637 1
orr[4] 1.22 0.00 0.13 0.99 1.49 5937 1
lp__ -1268.99 0.03 1.60 -1272.95 -1266.93 3375 1
Samples were drawn using NUTS(diag_e) at Fri Jan 20 00:35:46 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Here is the messing part again - interpretation. Before anything, note that alpha_orr
corresponds to the intercept. Each orr[1]
, orr[2]
,… and so on corresponds to the ordering of the columns’ in X
matrix containing the independent variables. If you runcolnames(X)
it will print this list:
Printing list of variable names in X matrix
alpha_orr
is the odds ratio for the interceptorr[1]
corresponds to the category inGate = 1
that means it has a gateorr[2]
corresponds to the category inGarage = 1
that means it has a garageorr[3]
corresponds to the category inOutdoor_sitting = 1
that means it has a outdoor settings to propertyorr[4]
corresponds to the category inSecurity lights = 1
that means it has a security lights outside its property
Interpretation of interesting coefficients:
- Gated properties (cat.): Compared to properties without gates, those with this security feature are targeted and hence 6% more likely to be a victim of burglary (OR: 1.06, 95% CrI 0.85-1.31).
- Properties with Garages (cat.): Compared to properties without garages, those with this feature are targeted and hence 4% more likely to be a victim of burglary (OR: 1.04, 95% CrI 0.64-1.56).
- Properties with Outdoor sittings (cat.): Compared to properties with no outdoor features, those with this feature are targeted and hence 8% more likely to be a victim of burglary (OR: 1.08, 95% CrI 0.85-1.35).
- Properties with Security lightings (cat.): Compared to properties with no secruity lights, those with this feature are targeted and hence 22% more likely to be a victim of burglary (OR: 1.22, 95% CrI 0.99-1.49).
3.5 Tasks
3.5.1 Task 1 - Obesity and Fastfoods in London
Accessibility to junk food restaurants in young adolescents especially after school hours is a growing cause for concern. Especially, now that many young adults have a sedentary lifestyle; hence obesity rates among this population is increasing in the UK.
Try this problem in Stan: Use the dataset Obesity and Fastfoods in MSOAs data.csv
to determine the links between prevalence of obesity in high school students and density of fast food (cheap) restaurant and deprivation in MSOAs in London. Implement a Bayesian GLM using Stan code.
Variable names:
SEQID
: ID number for rowMSOA11CD
: Unique identifier for MSOA areaMSOA11NM
: Name of the MSOA areaOBESE
: Number of child identified as obese in MSOA in LondonTOTAL
: Total number of children surveyed for BMI measurementsIMDMSOA
: Area-level socioeconomic deprivation score (higher scores means higher deprivation and vice versa)RESTCAT
: Categorical variable for classifying an MSOA in terms of density of junk/cheap fast food outlets restaurants:1 = 1 to 10
,2= 11 to 25
,3= 26 to 50
and4= 51 to 300
.
HINT: You might want to consider using the following functions: binomial_logit()
or binomial()
in the model block and reporting the odd ratios using the generated quantities block. Also, use the as.factor()
for independent variables that are categorical. You can put all independent variables into a matrix using the code using the model.matrix()
function. You might want to consider computing the exceedance probabilities for the odd ratios using the threshold of 1.
3.5.2 Task 2 - Factors affecting house prices in London (2015)
Try this problem in Stan: Use London LSOA 2015 data.csv
data pertained house prices in 2015, and assess it’s relationship with public transport accessibility (PTA), average income and socioeconomic deprivation (IMD) as the independent variables. Implement a Bayesian GLM using Stan code.
Variables names:
LSOACODE
: Unique identification code for the geographic areaAVEPRICE
: (Dependent variable) Average house price estimated for the LSOA in 2015AVEINCOME
: Estimated average annual income for households within an LSOA in 2015IMDSCORE
: Deprivation score for an LSOA in 2015PTAINDEX
: Measures levels of access/connectivity to public transport
HINT: You might want to consider using the following functions: normal()
in the model block. You might want to consider computing the exceedance probabilities for the coefficients using the threshold of 0.
Note: Solutions will be made available later today.