Week 2 Examining data I

2.1 Introduction

Welcome to your second week of Introduction to Quantitative Research Methods. This week we will focus on examining data using measures of central tendency and measures of dispersion. These measures are collectively known as descriptive statistics. We will also talk about some basic data visualisation. Also, as by now, everyone should be up and running with RStudio Server, we will apply some of these descriptive measures onto some data. Alright, let’s get to it.

2.2 Measures of central tendency

Any research project involving quantitative data should start with an exploration and examination of the available data sets. This applies both to data that you have collected yourself and data that you have acquired in a different way, e.g., through downloading official UK Census and labour market statistics. The set of techniques that is used to examine your data in first instance is called descriptive statistics. Descriptive statistics are used to describe the basic features of your data set and provide simple summaries about your data. Together with simple visual analysis, they form the basis of virtually every quantitative data analysis.

2.2.1 Accessing RStudio Server, uploading and importing data

Make sure to download the data set for Week 2 from HERE if you have not already done so from the Welcome Page. The data sets are:

  1. For the tutorial practicals: London historical population dataset.csv
  2. For the seminar tasks and questions: Ambulance and Assault Incidents data.csv

Let’s us first sign into RStudio Server:

  • Log into the RStudio Server: https://rstudio.data-science.rc.ucl.ac.uk/. Log in with your usual UCL username and password.
  • Create a new folder called Week 2
  • Upload the data London historical population dataset.csv into the folder of Week 2
  • Set the directory to the folder location of Week 2 to where the CSV file was uploaded by clicking on the More >> Set As Working Directory.
  • Finally, import the data into workspace using the read.csv() function.
# Load data into RStudio. The spreadsheet is stored in the object called 'London.Pop'
London.Pop <- read.csv("London historical population dataset.csv")

If you struggle with the above steps, as well as setting up your working directory. Then here is a reminder, have a look at how we did this last week!

Use the head(), or View() functions to examine the structure of the London population data frame. The head() functions allows the users to see the first 5 rows of the data frame in the R-Console. The View() allows the user to see the entire spreadsheet in a data viewer.

2.2.2 Mean, Median and Mode

We are going to use the following functions: mean() and median(), to compute the mean and median value respectively - these two estimates are one of the three measures of central tendency.

We can apply these measures on our London.Pop data set. Let’s do this by focusing on London’s population in 2011. We going to keep the 1st, 2nd and 24th columns in the London.Pop data frame which corresponds to the columns called Area Code, Area Name and Person-2011, respectively.

# select the relevant columns
London.Pop2011 <- London.Pop[,c(1:2,24)]

Now we can calculate our measures of central tendency using R’s built-in functions for the median and the mean.

# calculate the median of the 2011 population variable
median(London.Pop2011$Persons.2011)
## [1] 254096
# calculate the mean of the 2011 population variable
mean(London.Pop2011$Persons.2011)
## [1] 247695.2

Question(s)

  1. How do you explain that the median is larger than the mean for this variable? [HINT: What is distorting the mean value such that its brought the value down?]

R does not have a standard in-built function to calculate the mode. As we still want to show the mode, we create a user function to calculate the mode of our data set. This function takes a numeric vector as input and gives the mode value as output.

Note
You do not have to worry about creating your own functions, so just copy and paste the code below to create the get_mode() function.

# create a function to calculate the mode
get_mode <- function(x) {
  
  # get unique values of the input vector
  uniqv <- unique(x)
  
  # select the values with the highest number of occurrences
  uniqv[which.max(tabulate(match(x, uniqv)))]
}
# calculate the mode of the 2011 population variable
get_mode(London.Pop2011$Persons.2011)
## [1] 7375

Question(s)

  1. What is the level of measurement of our Persons.2011 variable? Nominal, ordinal, or interval/ratio?
  2. Even though we went through all the trouble to create our own function to calculate the mode, do you think it is a good choice to calculate the mode for this variable? Why? Why not?

Although R does most of the hard work for us, especially with the mean() and the median() function, it is a good idea to once go through the calculations of these two central tendency measures ourselves. Let’s calculate the mean step-by-step and then verify our results with the results of R’s mean() function.

# get the sum of all values
Persons.2011.Sum <- sum(London.Pop2011$Persons.2011)
# inspect the result
Persons.2011.Sum
## [1] 8173941
# get the total number of observations
Persons.2011.Obs <- length(London.Pop2011$Persons.2011)
# inspect the result 
Persons.2011.Obs
## [1] 33
# calculate the mean
Persons.2011.Mean <- Persons.2011.Sum / Persons.2011.Obs
# inspect the result
Persons.2011.Mean
## [1] 247695.2
# compare our result with R's built-in function
mean(London.Pop2011$Persons.2011) == Persons.2011.Mean
## [1] TRUE

Great. Our own calculation of the mean is identical to R’s built-in function. Now let’s do the same for the median.

# get the total number of observations
Persons.2011.Obs <- length(London.Pop2011$Persons.2011)
# inspect the result
Persons.2011.Obs
## [1] 33
# order our data from lowest to highest
Persons.2011.Ordered <- sort(London.Pop2011$Persons.2011, decreasing=FALSE)
# inspect the result
Persons.2011.Ordered
##  [1]   7375 158649 160060 182493 185911 186990 190146 199693 206125 219396
## [11] 220338 231997 237232 239056 246270 253957 254096 254557 254926 258249
## [21] 273936 275885 278970 288283 303086 306995 307984 309392 311215 312466
## [31] 338449 356386 363378
# get the number of the observation that contains the median value
Persons.Median.Obs <- (Persons.2011.Obs + 1)/2
# inspect the result
Persons.Median.Obs
## [1] 17
# get the median
Persons.2011.Median <- Persons.2011.Ordered[Persons.Median.Obs]
# inspect the result
Persons.2011.Median
## [1] 254096
# compare our result with R's built-in function
median(London.Pop2011$Persons.2011) == Persons.2011.Median
## [1] TRUE

2.3 Simple plots

Before moving on to the second set of descriptive statistics, the measures of dispersion, this is a good moment to note that simple data visualisations are also an extremely powerful tool to explore your data. In fact, tools to create high quality plots have become one of R’s greatest assets. This is a relatively recent development since the software has traditionally been focused on the statistics rather than visualisation. The standard installation of R has base graphic functionality built in to produce very simple plots. For example we can plot the relationship between the London population in 1811 and 1911.

Note
Next week we will be diving deeper into data visualisation and making plots and graphs in R, but for now it is a good idea to already take a sneak peek at how to create some basic plots.

# make a quick plot of two variables of the London population data set
plot(London.Pop$Persons.1811,London.Pop$Persons.1911)

Experimenting with plot()

  1. What happens if you change the order of the variables you put in the plot() function? Why?
  2. Instead of using the $ to select the columns of our data set, how else can we get the same results?

The result of calling the plot() function, is a very simple scatter graph. The plot() function offers a huge number of options for customisation. You can see them using the ?plot help pages and also the ?par help pages (par in this case is short for parameters). There are some examples below (note how the parameters come after specifying the x and y columns).

# add a title, change point colour, change point size
plot(London.Pop$Persons.1811, London.Pop$Persons.1911, main='Quick Plot in R', col='blue', cex=2)

# add a title, change point colour, change point symbol
plot(London.Pop$Persons.1811, London.Pop$Persons.1911, main="Another Quick Plot in R", col='magenta', pch=22)

Also, you can apply titles to the axes using the xlab ="" and ylab="" argument after the main="" in the plot().

# add a axis titles, rotate number labels
plot(London.Pop$Persons.1811, London.Pop$Persons.1911, main="Another Quick Plot in R", xlab = "Population in 1811", ylab = "Population in 1911", col='magenta', pch=22)

You can prevent R from printing the scientific notation (e.g., 2e+05, 4e+05 etc.,) on the y-axis in the graphical output. You can turn it off by typing the following code before running the plot() function

# turn-off horrible scientific notation
options(scipen = 999)
# add a axis titles, rotate number labels
plot(London.Pop$Persons.1811, London.Pop$Persons.1911, main="Another Quick Plot in R", xlab = "Population in 1811", ylab = "Population in 1911", col='magenta', pch=22)

Note
For more information on the plot parameters (some have obscure names) have a look here: http://www.statmethods.net/advgraphs/parameters.html

Recap

In this section you have learnt how to:

  1. Calculate the mode, median, and mean of a variable in R.
  2. Make some simple scatter plots (in preparation for next week) to visualise your data.

2.4 Measures of dispersion

When exploring your data, measures of central tendency alone are not enough as they only tell you what a ‘typical’ value looks like but they do not tell you anything about all other values. Therefore we also need to look at some measures of dispersion. Measures of dispersion describe the spread of data around a central value (e.g. the mean, the median, or the mode). The most commonly used measure of dispersion is the standard deviation. The standard deviation is a measure to summarise the spread of your data around the mean. The short video below will introduce you to the standard deviation as well as to three other measures of dispersion: the range, the interquartile range, and the variance.

For the rest of this tutorial we will change our data set to one containing the number of assault incidents that ambulances have been called to in London between 2009 and 2011. You will need to download a prepared version of this file called: Ambulance and Assault Incidents data.csv and upload it to your working directory. It is in the same data format (csv) as our London population file so we use the read.csv() command again.

# Load data into RStudio. The spreadsheet is stored in the object called 'London.Ambulance'
London.Ambulance <- read.csv('Ambulance and Assault Incidents data.csv')
# inspect the results 
head(London.Ambulance)
##   BorCode     WardName WardCode                WardType Assault_09_11
## 1    00AA   Aldersgate   00AAFA Prospering Metropolitan            10
## 2    00AA      Aldgate   00AAFB Prospering Metropolitan             0
## 3    00AA    Bassishaw   00AAFC Prospering Metropolitan             0
## 4    00AA Billingsgate   00AAFD Prospering Metropolitan             0
## 5    00AA  Bishopsgate   00AAFE Prospering Metropolitan           188
## 6    00AA Bread Street   00AAFF Prospering Metropolitan             0
# inspect the number of rows and columns of the data set
dim(London.Ambulance)
## [1] 649   5

You will notice that the data table has 5 columns and 649 rows. The column headings are abbreviations of the following:

Column heading Full name Description
BorCode Borough Code London has 32 Boroughs (such as Camden, Islington, Westminster, etc.) plus the City of London at the centre. These codes are used as a quick way of referring to them from official data sources.
WardName Ward Name Boroughs can be broken into much smaller areas known as Wards. These are electoral districts and have existed in London for centuries.
WardCode Ward Code A statistical code for the wards above.
WardType Ward Type A classification that groups wards based on similar characteristics.
Assault_09_11 Assault Incidents The number of assault incidents requiring an ambulance between 2009 and 2011 for each ward in London.

Let’s start by calculating two measures of central tendency by using the median() and mean() functions.

# calculate the median of the assault incident variable
median(London.Ambulance$Assault_09_11)
## [1] 146
# calculate the mean of the assault incident variable
mean(London.Ambulance$Assault_09_11)
## [1] 173.4669

Questions

  1. How do you explain that the mean is larger than the median for this variable?

Great. Let’s now calculate some measures of dispersion for our data: the range, the interquartile range, and the standard deviation. The calculation of the range is very straightforward as we only need to subtract the the minimum value from the the maximum value. We can find these values by using the built-in min() and max() functions.

# get the minimum value of the assault incident variable
min(London.Ambulance$Assault_09_11)
## [1] 0
# get the maximum value of the assault incident variable
max(London.Ambulance$Assault_09_11)
## [1] 1582
# calculate the range
1582 - 0
## [1] 1582
# or in one go
max(London.Ambulance$Assault_09_11) - min(London.Ambulance$Assault_09_11)
## [1] 1582

Questions

  1. What does this range mean?

The interquartile range requires a little bit more work to be done as we now need to work out the values of the 25th and 75th percentile.

Note
A percentile is a score at or below which a given percentage of your data points fall. For example, the 50th percentile (also known as the median!) is the score at or below which 50% of the scores in the distribution may be found.

# get the total number of observations
London.Ambulance.Obs <- length(London.Ambulance$Assault_09_11)
# inspect the result
London.Ambulance.Obs
## [1] 649
# order our data from lowest to highest
London.Ambulance.Ordered <- sort(London.Ambulance$Assault_09_11, decreasing=FALSE)
# inspect the result
London.Ambulance.Ordered
##   [1]    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
##  [16]    0   10   18   19   21   22   25   28   28   28   29   30   34   36   36
##  [31]   38   40   41   41   42   42   42   43   43   44   44   45   46   46   47
##  [46]   47   48   48   48   49   49   49   49   49   50   50   50   51   51   51
##  [61]   51   51   51   52   53   53   54   54   54   55   55   55   56   56   56
##  [76]   56   57   57   57   57   57   58   59   59   60   61   62   63   63   64
##  [91]   64   64   64   64   64   64   65   65   65   66   66   66   67   67   67
## [106]   68   69   69   69   69   70   70   71   71   71   71   72   73   74   74
## [121]   74   74   75   75   75   76   76   76   76   76   76   76   77   77   77
## [136]   77   77   77   78   78   78   79   79   79   80   80   80   80   81   81
## [151]   82   82   82   83   83   84   85   85   85   86   86   86   86   87   87
## [166]   87   87   87   88   88   88   89   89   89   89   89   91   91   93   93
## [181]   94   94   96   97   97   97   98   98   98   99   99  100  100  101  102
## [196]  102  102  102  102  103  104  105  105  105  105  105  106  107  107  107
## [211]  108  108  108  108  109  109  109  109  110  110  110  110  111  111  112
## [226]  112  113  113  113  113  114  114  114  114  114  114  115  115  116  116
## [241]  117  117  117  117  118  118  119  119  119  119  119  120  120  120  121
## [256]  121  122  122  122  123  124  124  125  125  125  125  125  125  126  127
## [271]  127  127  127  128  128  128  129  129  130  130  130  131  131  131  132
## [286]  132  132  132  132  133  133  134  135  135  135  137  137  138  138  138
## [301]  138  138  139  139  140  141  142  142  142  142  143  143  143  143  144
## [316]  144  144  144  145  145  145  146  146  146  146  146  146  148  148  148
## [331]  149  150  150  150  151  151  152  152  152  153  153  153  153  154  155
## [346]  155  155  156  156  157  157  157  157  158  158  158  158  160  161  161
## [361]  161  161  162  162  163  163  164  164  165  165  166  167  169  169  170
## [376]  170  170  171  171  172  172  175  175  175  176  178  178  180  181  181
## [391]  181  182  182  182  183  183  184  184  184  185  185  185  186  188  190
## [406]  190  190  191  192  192  193  193  193  194  194  194  194  194  194  195
## [421]  195  195  195  196  196  196  197  198  198  199  200  200  201  201  201
## [436]  202  203  204  205  206  206  208  208  208  208  208  209  209  209  210
## [451]  210  210  210  212  212  212  213  213  213  213  214  215  215  216  217
## [466]  220  220  221  221  222  222  222  224  225  226  226  227  229  229  230
## [481]  230  231  232  232  232  232  233  233  234  234  235  235  236  237  237
## [496]  238  238  240  243  243  244  245  245  245  247  247  247  248  249  251
## [511]  251  251  251  252  252  253  253  253  254  255  257  257  258  258  258
## [526]  258  259  259  259  260  260  260  261  264  265  265  266  267  267  268
## [541]  270  271  272  272  273  273  274  274  274  278  278  278  279  281  283
## [556]  283  283  284  285  287  288  290  291  292  292  293  293  294  295  296
## [571]  296  297  300  300  302  303  304  305  306  306  313  314  315  315  315
## [586]  315  318  319  321  324  325  326  326  328  329  331  334  334  339  339
## [601]  339  341  342  347  350  352  352  355  358  359  361  363  365  365  366
## [616]  369  371  371  374  376  378  383  405  407  409  420  429  429  435  437
## [631]  437  442  457  458  464  467  473  482  484  495  502  535  542  573  579
## [646]  602  765 1305 1582
# get the number ('index value') of the observation that contains the 25th percentile
London.Ambulance.Q1 <- (London.Ambulance.Obs + 1)/4
# inspect the result
London.Ambulance.Q1
## [1] 162.5
# get the number ('index value') of the observation that contains the 75th percentile
London.Ambulance.Q3 <- 3*(London.Ambulance.Obs + 1)/4
# inspect the result
London.Ambulance.Q3
## [1] 487.5
# get the 25th percentile
London.Ambulance.Ordered[163]
## [1] 86
# get the 75the percentile
London.Ambulance.Ordered[488]
## [1] 233
# get the interquartile range
233 - 86
## [1] 147

As explained at the very end of this lecture video, we can also visually represent our range, median, and interquartile range using a box and whisker plot:

# make a quick boxplot of our assault incident variable
boxplot(London.Ambulance$Assault_09_11, horizontal=TRUE)

Questions

  1. There is a large difference between the range that we calculated and the interquartile range that we calculated. What does this mean?
  2. The 25th and 75th percentile in the example do not return integer but a fraction (i.e. 162.5 and 487.5). Why do we use 163 and 488 to extract our percentile values and not 162 and 487?

Now, let’s move to the standard deviation. Remember: this is one of the most important measures of dispersion and is widely used in all kinds of statistics. The calculation involves the following steps as extensively explained in this lecture video:

  1. Calculate the mean.
  2. Subtract the mean from each observation to get a residual.
  3. Square each residual.
  4. Sum all residuals.
  5. Divide by \(n-1\).
  6. Take the square root of the final number.
# calculate the mean
London.Ambulance.Mean <- mean(London.Ambulance$Assault_09_11)
# subtract the mean from each observation
London.Ambulance.Res <- London.Ambulance$Assault_09_11 - London.Ambulance.Mean
# square each residual
London.Ambulance.Res.Sq <- London.Ambulance.Res**2
# sum all squared residuals
London.Ambulance.Res.Sum <- sum(London.Ambulance.Res.Sq)
# divide the sum of all sqaured residuals by n - 1
London.Ambulance.Variance <- London.Ambulance.Res.Sum / (length(London.Ambulance$Assault_09_11) - 1)
# take the square root of the final number
London.Ambulance.Sd <- sqrt(London.Ambulance.Variance)
# standard deviation
London.Ambulance.Sd
## [1] 130.3482

There we go. We got our standard deviation! You probably already saw this coming, but R does have some built-in functions to actually calculate these descriptive statistics for us: range(), IQR(), and sd() will do all the hard work for us!

# range
range(London.Ambulance$Assault_09_11)
## [1]    0 1582
# interquartile range
IQR(London.Ambulance$Assault_09_11)
## [1] 147
# standard deviation
sd(London.Ambulance$Assault_09_11)
## [1] 130.3482

Note
Please be aware that the IQR() function may give slighlty different results in some cases when compared to a manual calculation. This is because the forumula that the IQR() function uses is slightly different than the formula that we have used in our manual calculation. It is noted in the documentation of the IQR() function that: “Note that this function computes the quartiles using the quantile function rather than following Tukey’s recommendations, i.e., IQR(x) = quantile(x, 3/4) - quantile(x, 1/4).”

Questions

  1. What does it mean that we have a standard deviation of 130.3482?
  2. Given the context of the data, do you think this is a low or a high standard deviation?

To make things even easier, R also has a summary() function that calculates a number of these routine statistics simultaneously. After running the summary() function on our assault incident variable, you should see you get the minimum (Min.) and maximum (Max.) values of the assault_09_11 column; its first (1st Qu.) and third (3rd Qu.) quartiles that comprise the interquartile range; its the mean and the median. The built-in R summary() function does not calculate the standard deviation. There are functions in other libraries that calculate more detailed descriptive statistics,

# calculate the most common descriptive statistics for the assault incident variable
summary(London.Ambulance$Assault_09_11)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    86.0   146.0   173.5   233.0  1582.0

Recap

In this section you have learnt how to:

  1. Calculate the range, interquartile range, and standard deviation in R.
  2. Create a simple boxplot using the boxplot() function.
  3. Quickly get common descriptive statistics using the summary() function.

2.5 Seminar task & questions

Seminar Task: Use the seminar, you will be continuing with the data set Ambulance and Assault Incidents data.csv. Still work with the data frame object named as London.Ambulance.

  • Create a new object / data set that only contains data for ward type Suburbs and Small Towns. Hint: Try to subset the data by filtering it based on WardType.
  • Calculate the mode, median, mean, range, interquartile range, and standard deviation for the Assault_09_11 variable for Suburbs and Small Towns
  • Produce a boxplot() that provides a visual description of it’s distribution

Seminar questions

Compare the results of the descriptive statistics you have calculated for your Suburbs and Small Towns object with the results of the descriptive statistics you have calculated for you Prospering Metropolitan object / data set.

  • What do these differences tell us about the levels of violent assaults within these separate environments?
  • Create a dual boxplot to show a visual representation of this comparison.

Important Note: The solution codes will be released next week on Monday. Oh yeah - before you leave, do save your R script by pressing the Save button in the script window. You survived - that is it for this week!