Week 2: Introduction II

1 Handling Data Structures in RStudio

1.1 Learning objectives

By the end of this tutorial, you will be able to:

  1. Set your working directory on Mac and Windows
  2. Import a .csv file into RStudio
  3. Understand the structure of a data frame (rows and columns)
  4. Filter data using both numeric and categorical variables
  5. Combine multiple conditions with logical operators to perform further filtering of data
  6. Export data back into a .csv file

These objectives, combined with those from last week, serve as a gateway to learning RStudio and building a strong foundation. Let us begin!

1.2 Setting the Working Directory to Datasets (Length: 00:05:24)

[Watch on YouTube]

Warning

Before we do anything - make sure to have downloaded the dataset for this computer session by clicking [HERE]. It contains the file Primary Schools in Ealing.csv - this comma separated values (CSV) file contains the data needed to follow today’s tutorial.

Instructions In your computer, do the following:

  1. Create a new folder on your desktop and rename the folder GEOG0186
  2. Next, create a new sub-folder within GEOG0186 and rename it as Week 2.
  3. From the downloaded folder Dataset for Week 2, make sure to unzip and transfer ALL the datasets directly to the Week 2 folder.

This part is probably the most important section of this tutorial. We are going to learn how to set the Work Directory. This basically refers to us connecting the RStudio to the folder containing our dataset that we want to import and analyse.

Doing this allows the user to tell RStudio to open data from a folder located somewhere in our computer using something called the Path Location.

The Path Location specifies the whereabouts of the data file(s) stored within a computer. Setting your directory in RStudio, in code, beforehand makes life incredibly easier (than the usual point-and-click approach) in terms of finding, importing, exporting and saving data in and out of RStudio.

To illustrate what a Path Location is – suppose on my desktop’s dashboard on a Mac/Windows there is a folder called GEOG0186, and within that folder, exists another folder called Week 2. Finally, suppose a comma separated value (.csv) data file called Primary Schools in Ealing.csv is store in this folder i.e., Week 2. If via RStudio you want to open this CSV data file located in within the Week 2 folder. You will need to first set the path to Week 2 in RStudio using the setwd() function.

1.2.1 Setting the Path Location on MAC (Length: 00:06:44)

[Watch on YouTube]

For MAC users, the path location would be written as follows, "/Users/accountName/Desktop/GEOG0186/Week 2". You can access this piece of information simply by:

  1. Open the folder GEOG0186
  2. Right-clicking on the folder Week 2 in which the files are stored. A drop-down scroll menu will appear (see image below).

  1. Hold the Option () button on your keyboard down, and click Copy “Week 2” as Pathname
  2. Paste the copied path name into the function setwd() and run the code

For Mac, the setwd() is as follows:

# set work directory in macs
setwd("/Users/accountName/Desktop/GEOG0186/Week 2")

1.2.2 Setting the Path Location on Windows (Length: 00:05:07)

[Watch on YouTube]

For Windows user (on UCL PCs), its marginally different - the path location to this folder on a Windows machine would be written as follows, "N:/Desktop/GEOG00186/Week 2". You can access this piece of information simply by:

  1. Open the GEOG0186 folder to reveal the Week 2 folder.
  2. Open the Week 2 folder where your downloaded data files have been stored.
  3. Now, click on the browser bar at the top which shows GEOG0186 > Week 2. This should highlight and show "N:\Desktop\GEOG0186\Week 2" (see image below).

  1. Now, copy "N:\Desktop\GEOG0186\Week 2" and paste the path name into the setwd() function in your R script.
  2. Lastly, change all the back slashes \ in the path name to forward slashes / and run the code. It should look like this: setwd("N:/Desktop/GEOG0186/Week 2").

For Windows, the setwd() is as follows:

# set work directory in windows (using UCL PC)
setwd("N:/Desktop/GEOG0186/Week 2")

# note that on usual Windows OS on a personal computer - the code looks something like:
setwd("C:/Users/accountName/Desktop/GEOG0186/Week 2")

If you type the code getwd() and quickly run it through console - if it returns the inputted path location - then you have done this correctly.

Now, let us learn how to import a CSV data into RStudio.

1.3 How to Import a Dataset into RStudio (Length: 00:06:21)

[Watch on YouTube]

A CSV file (short for Comma-Separated Values) is one of the most common formats for storing data. Each row is a single record or observation. For example, it can be details about a person, household unit, a city, or even disease events; and while each column is a variable (for example, population counts, temperature readings, or region name etc.).

For this exercise, we will be using the Primary Schools in Ealing.csv dataset to build intuition on understanding the basics of data structure.

Since, we have already set the work directory to folder containing this dataset. Importing it into RStudio would be a piece of cake.

To read a CSV file into R, we can use the read.csv() function import it as a data frame object named school_data using the assignment operator (<-):

# Load data into RStudio. The spreadsheet is stored in the object called 'school_data'
school_data <- read.csv(file = "Primary Schools in Ealing.csv")

The loaded dataset contains up to 58 primary schools with the following variables:

  • SchoolName: Name of school in Ealing (character)
  • Type: School classified as a “Primary” school (character)
  • NumberBoys: Total number of boys in a primary school (integer/)
  • NumberGirls: Total number of girls in a primary school (integer)
  • TotalStudents: Total number of students in a primary school (integer)
  • OfstedGrade: Overall school performance where 1 = “excellent”, 2 = “good”, 3 = “requires improvement”, and 4 = “inadequate” (factor/categorical)
Important

We have covered a lot of the basics in RStudio - i.e., setting the work directory and importing a spreadsheet that is CSV format. The things shown in this section will be used frequently in all future tutorials. So, get used to using setwd() and read.csv().

1.4 Data Structure

1.4.1 Inspecting the Size and Structure (Length: 00:08:16)

[Watch on YouTube]

One can examine the structure of the imported data with the following basic functions.

  • str(): tells the user which columns in the data frame are character or numeric variables, and so on.
  • names(): prints the entire names of the columns present in the data frame
  • head(): allows the user to see the first top 10 rows of the data frame
  • tail(): allows the user to see the last bottom 10 rows of the data frame
  • ncol(): tells the user the total number of columns present in the data frame
  • nrow(): tells the user the total number of observations (or rows) present in the data frame
  • dim(): prints both the total number of observations (or rows) and columns present in the data frame.

Using the following code str() will display the structure of school_data data frame object and some of its contents:

str(school_data)
'data.frame':   58 obs. of  6 variables:
 $ SchoolName   : chr  "Berrymede Junior School" "East Acton Primary School" "Oldfield Primary School" "North Ealing Primary School" ...
 $ Type         : chr  "Primary" "Primary" "Primary" "Primary" ...
 $ NumberBoys   : int  180 160 225 340 265 230 155 285 185 150 ...
 $ NumberGirls  : int  200 165 240 355 205 205 180 290 190 125 ...
 $ TotalStudents: int  377 329 465 697 471 435 335 573 373 273 ...
 $ OfstedGrade  : int  2 2 2 2 2 2 2 3 2 2 ...

Using the following code head() or tail() will display the first or last 10 observations, respectively, in the school_data data frame object:

head(school_data)
                   SchoolName    Type NumberBoys NumberGirls TotalStudents
1     Berrymede Junior School Primary        180         200           377
2   East Acton Primary School Primary        160         165           329
3     Oldfield Primary School Primary        225         240           465
4 North Ealing Primary School Primary        340         355           697
5    St John's Primary School Primary        265         205           471
6    St Mark's Primary School Primary        230         205           435
  OfstedGrade
1           2
2           2
3           2
4           2
5           2
6           2
tail(school_data)
                                       SchoolName    Type NumberBoys
53           St Vincent's Catholic Primary School Primary        240
54 Edward Betham Church of England Primary School Primary        230
55                    Dormers Wells Junior School Primary        205
56                          Grange Primary School Primary        415
57                      Petts Hill Primary School Primary        125
58                       Khalsa VA Primary School Primary        235
   NumberGirls TotalStudents OfstedGrade
53         260           498           2
54         240           471           2
55         200           409           3
56         385           800           2
57         130           254           2
58         180           412           2

You can inspect the data frame object in a Data viewer by using the View() syntax. This should cause the Data viewer window to open showing the full dataset:

View(school_data)

The dataset has 58 rows (primary school observations) and 6 columns (variables). You can use nrow, ncol and dim() functions separately to report the size of your data frame:

# total number of rows
nrow(school_data)
[1] 58
# total number of columns
ncol(school_data)
[1] 6
# full size of data frame i.e., total rows and columns
dim(school_data)
[1] 58  6

1.4.2 Basic Manipulation of Data Frame

1.4.2.1 Subsetting using Rows and Columns (Length: 00:17:06)

[Watch on YouTube]

One can subset or restrict the data frame by specifying which row(s) and column(s) to keep or discard using this standard square bracket syntax dataframe[Row, Column].

Breakdown of dataframe[Row, Column]code:

  1. The bit in the code that corresponds to dataframe from dataframe[Row, Column] represents the name data frame object.
  2. Where it states Row within the square brackets specifies the subsetting, or filter action to be carried out based on rows.
  3. Where it states Column within the square brackets specifies the subsetting, or filter action to be carried out based on columns.

Let’s take our school_data data frame as an example - if we select the first row (row number 1) and the first column (SchoolName), we have essentially filter out only the school name Berrymede Junior School Primary:

data11 <- school_data[1, 1]
data11
[1] "Berrymede Junior School"

If you select only the first row (row number 1) and do not specific any numbers for the columns, you will be filtering out the entire row for row number 1:

row1 <- school_data[1, ]
row1
               SchoolName    Type NumberBoys NumberGirls TotalStudents
1 Berrymede Junior School Primary        180         200           377
  OfstedGrade
1           2

In the same vein, if you select only the column and do not specific any numbers for the rows, you will be filtering out the entire column (i.e., SchoolName variable) for column number 1 which is the list of all primary school names:

column1 <- school_data[, 1]
column1
 [1] "Berrymede Junior School"                           
 [2] "East Acton Primary School"                         
 [3] "Oldfield Primary School"                           
 [4] "North Ealing Primary School"                       
 [5] "St John's Primary School"                          
 [6] "St Mark's Primary School"                          
 [7] "West Twyford Primary School"                       
 [8] "West Acton Primary School"                         
 [9] "Mayfield Primary School"                           
[10] "Beaconsfield Primary and Nursery School"           
[11] "Coston Primary School"                             
[12] "Downe Manor Primary School"                        
[13] "Drayton Green Primary School"                      
[14] "North Primary School"                              
[15] "Ravenor Primary School"                            
[16] "Selborne Primary School"                           
[17] "Hambrough Primary School"                          
[18] "Hobbayne Primary School"                           
[19] "John Perryn Primary School"                        
[20] "Southfield Primary School"                         
[21] "Allenby Primary School"                            
[22] "Blair Peach Primary School"                        
[23] "Clifton Primary School"                            
[24] "Dairy Meadow Primary School"                       
[25] "Derwentwater Primary School"                       
[26] "Durdans Park Primary School"                       
[27] "Fielding Primary School"                           
[28] "Gifford Primary School"                            
[29] "Greenwood Primary School"                          
[30] "Havelock Primary School and Nursery"               
[31] "Horsenden Primary School"                          
[32] "Willow Tree Primary School"                        
[33] "Lady Margaret Primary School"                      
[34] "Little Ealing Primary School"                      
[35] "Oaklands Primary School"                           
[36] "Perivale Primary School"                           
[37] "Stanhope Primary School"                           
[38] "Viking Primary School"                             
[39] "Wolf Fields Primary School"                        
[40] "Featherstone Primary and Nursery School"           
[41] "Three Bridges Primary School"                      
[42] "Montpelier Primary School"                         
[43] "Tudor Primary School"                              
[44] "Hathaway Primary School"                           
[45] "Vicar's Green Primary School"                      
[46] "Mount Carmel Catholic Primary School"              
[47] "Our Lady of the Visitation Catholic Primary School"
[48] "St John Fisher Catholic Primary School"            
[49] "St Anselm's Catholic Primary School"               
[50] "St Gregory's Catholic Primary School"              
[51] "St Joseph's Catholic Primary School"               
[52] "St Raphael's Catholic Primary School"              
[53] "St Vincent's Catholic Primary School"              
[54] "Edward Betham Church of England Primary School"    
[55] "Dormers Wells Junior School"                       
[56] "Grange Primary School"                             
[57] "Petts Hill Primary School"                         
[58] "Khalsa VA Primary School"                          

What if we wanted to filter the following rows numbered 2, 7 and 19 from the school_data data frame object as highlighted in the image below?

We can use the combine function i.e., c() to list those numbers in the square brackets:

rows_filter <- school_data[c(2, 7, 19), ]
rows_filter
                    SchoolName    Type NumberBoys NumberGirls TotalStudents
2    East Acton Primary School Primary        160         165           329
7  West Twyford Primary School Primary        155         180           335
19  John Perryn Primary School Primary        230         230           460
   OfstedGrade
2            2
7            2
19           2

Likewise, what if we wanted to filter the following columns numbered 1, 5 and 6 from the school_data data frame object as highlighted in the image below?

We can also use the combine function i.e., c() to list those numbers for the columns within that square brackets:

columns_filter <- school_data[ , c(1, 5, 6)]
columns_filter
                                           SchoolName TotalStudents OfstedGrade
1                             Berrymede Junior School           377           2
2                           East Acton Primary School           329           2
3                             Oldfield Primary School           465           2
4                         North Ealing Primary School           697           2
5                            St John's Primary School           471           2
6                            St Mark's Primary School           435           2
7                         West Twyford Primary School           335           2
8                           West Acton Primary School           573           3
9                             Mayfield Primary School           373           2
10            Beaconsfield Primary and Nursery School           273           2
11                              Coston Primary School           450           2
12                         Downe Manor Primary School           471           2
13                       Drayton Green Primary School           354           2
14                               North Primary School           419           1
15                             Ravenor Primary School           610           2
16                            Selborne Primary School           554           2
17                           Hambrough Primary School           520           2
18                            Hobbayne Primary School           641           2
19                         John Perryn Primary School           460           2
20                          Southfield Primary School           518           2
21                             Allenby Primary School           263           2
22                         Blair Peach Primary School           506           2
23                             Clifton Primary School           400           2
24                        Dairy Meadow Primary School           469           2
25                        Derwentwater Primary School           708           2
26                        Durdans Park Primary School           528           2
27                            Fielding Primary School           891           2
28                             Gifford Primary School           866           1
29                           Greenwood Primary School           555           2
30                Havelock Primary School and Nursery           434           1
31                           Horsenden Primary School           872           2
32                         Willow Tree Primary School           759           2
33                       Lady Margaret Primary School           671           2
34                       Little Ealing Primary School           669           2
35                            Oaklands Primary School           552           2
36                            Perivale Primary School           467           2
37                            Stanhope Primary School           598           2
38                              Viking Primary School           275           2
39                         Wolf Fields Primary School           421           2
40            Featherstone Primary and Nursery School           722           2
41                       Three Bridges Primary School           455           2
42                          Montpelier Primary School           682           1
43                               Tudor Primary School           453           2
44                            Hathaway Primary School           346           4
45                       Vicar's Green Primary School           342           1
46               Mount Carmel Catholic Primary School           457           2
47 Our Lady of the Visitation Catholic Primary School           471           3
48             St John Fisher Catholic Primary School           467           1
49                St Anselm's Catholic Primary School           239           3
50               St Gregory's Catholic Primary School           619           2
51                St Joseph's Catholic Primary School           558           2
52               St Raphael's Catholic Primary School           589           2
53               St Vincent's Catholic Primary School           498           2
54     Edward Betham Church of England Primary School           471           2
55                        Dormers Wells Junior School           409           3
56                              Grange Primary School           800           2
57                          Petts Hill Primary School           254           2
58                           Khalsa VA Primary School           412           2

What if we wanted to filter on rows numbered 2, 7 and 19, as well as columns numbered 1, 5 and 6 from that school_data data frame object as highlighted in the image below?

full_filter <- school_data[c(2, 7, 19) , c(1, 5, 6)]
full_filter
                    SchoolName TotalStudents OfstedGrade
2    East Acton Primary School           329           2
7  West Twyford Primary School           335           2
19  John Perryn Primary School           460           2

Not too shabby! So far, you have been shown how to do some create subset of data by filtering rows and columns. Let’s take it up a notch on filtering based on row Logical Operators to create conditions.

1.4.2.2 Subsetting using Logical Operators to Combine Conditions (Length: 00:15:38)

[Watch on YouTube]

Often, we want to filter based on more than one condition — for example, “cities in England and with a population over 600000”. RStudio uses logical operators to combine these conditions:

Operator Meaning Example Description
== Equal to country == "England" True if the country is England
!= Not Equal to country != "England" True if for all except England
> Greater than population > 600000 True if population is above 600000
< Less than population < 600000 True if population is under 600000
>= Greater than or eqaul to population >= 600000 True if population is equal to 600000 or more
<= Less than or equal to population <= 600000 True if population is equal to 600000 or less
& AND country == "England" & population <= 600000 Both conditions are true
| OR country == "England" | country == "Wales" True for either England or Wales

Let’s illustrate with the school_data object. We are interested in primary schools with a total of 500 hundred or more students. This operation is always row-based. We will need to use the $ to call the column TotalStudents and the logical operator for this filter is >=:

schools_500plus <- school_data[school_data$TotalStudents >= 500, ]
schools_500plus
                                SchoolName    Type NumberBoys NumberGirls
4              North Ealing Primary School Primary        340         355
8                West Acton Primary School Primary        285         290
15                  Ravenor Primary School Primary        315         295
16                 Selborne Primary School Primary        275         280
17                Hambrough Primary School Primary        275         245
18                 Hobbayne Primary School Primary        340         300
20               Southfield Primary School Primary        275         245
22              Blair Peach Primary School Primary        265         245
25             Derwentwater Primary School Primary        335         370
26             Durdans Park Primary School Primary        275         250
27                 Fielding Primary School Primary        480         410
28                  Gifford Primary School Primary        435         435
29                Greenwood Primary School Primary        290         265
31                Horsenden Primary School Primary        440         435
32              Willow Tree Primary School Primary        385         370
33            Lady Margaret Primary School Primary        360         310
34            Little Ealing Primary School Primary        345         325
35                 Oaklands Primary School Primary        290         260
37                 Stanhope Primary School Primary        305         295
40 Featherstone Primary and Nursery School Primary        365         360
42               Montpelier Primary School Primary        350         330
50    St Gregory's Catholic Primary School Primary        305         310
51     St Joseph's Catholic Primary School Primary        295         265
52    St Raphael's Catholic Primary School Primary        325         265
56                   Grange Primary School Primary        415         385
   TotalStudents OfstedGrade
4            697           2
8            573           3
15           610           2
16           554           2
17           520           2
18           641           2
20           518           2
22           506           2
25           708           2
26           528           2
27           891           2
28           866           1
29           555           2
31           872           2
32           759           2
33           671           2
34           669           2
35           552           2
37           598           2
40           722           2
42           682           1
50           619           2
51           558           2
52           589           2
56           800           2

What if were interested in primary schools with an OFSTED score of 1 (Excellent). Again, this operation is row-based and so we will need to use the $ to call the column OfstedGrade and the logical operator for this filter is ==:

schools_ofsted1 <- school_data[school_data$OfstedGrade == 1, ]
schools_ofsted1
                               SchoolName    Type NumberBoys NumberGirls
14                   North Primary School Primary        215         205
28                 Gifford Primary School Primary        435         435
30    Havelock Primary School and Nursery Primary        245         190
42              Montpelier Primary School Primary        350         330
45           Vicar's Green Primary School Primary        165         180
48 St John Fisher Catholic Primary School Primary        225         240
   TotalStudents OfstedGrade
14           419           1
28           866           1
30           434           1
42           682           1
45           342           1
48           467           1

Here is an example of combining such logical operators - select schools that have more than 650 students and with an OFSTED score of 2.

# logical operators is >, &, ==
schools_650_ofsted2 <- school_data[school_data$TotalStudents > 650 & school_data$OfstedGrade == 2, ]
schools_650_ofsted2
                                SchoolName    Type NumberBoys NumberGirls
4              North Ealing Primary School Primary        340         355
25             Derwentwater Primary School Primary        335         370
27                 Fielding Primary School Primary        480         410
31                Horsenden Primary School Primary        440         435
32              Willow Tree Primary School Primary        385         370
33            Lady Margaret Primary School Primary        360         310
34            Little Ealing Primary School Primary        345         325
40 Featherstone Primary and Nursery School Primary        365         360
56                   Grange Primary School Primary        415         385
   TotalStudents OfstedGrade
4            697           2
25           708           2
27           891           2
31           872           2
32           759           2
33           671           2
34           669           2
40           722           2
56           800           2

I am sure you now get the gist of what is happening here with these logical statements for subsetting, or filtering data accordingly. We are now in the final stretch - let us learn how to save and export a dataset as a CSV spreadsheet.

1.5 Saving your dataset (Length: 00:11:13)

[Watch on YouTube]

This operation is very easy to perform. Once you have created your filtered data, you can use the following function write.csv() to save it as a new CSV file:

write.csv(schools_650_ofsted2, file = "Filtered_example_data.csv", row.names = FALSE)

The above syntax should save a new CSV file named as Filtered_example_data.csv. Do not forget to save your R-script!

1.6 Summary

You have learned the following:

  1. Set your working directory on Mac and Windows
  2. Import a .csv file into RStudio
  3. Understand the structure of a data frame (rows and columns)
  4. Filter data using both numeric and categorical variables
  5. Combine multiple conditions with logical operators to perform further filtering of data
  6. Export data back into a .csv file

2 Worksheet One: Workflow and Generating Data in RStudio

In this worksheet, we will put into practice some of the basics we learned in Introduction I and II. The aim of this worksheet is to transcribe the image data containing stormwater drain infrastructure and sanitation information using the following R code syntax:

  • Assignment operator <-
  • Combine function c()
  • Data frame function data.frame()
  • Display data structure of output str()
  • Using logical operators
  • Save function write.csv()

Scenario: Stormwater Drainage Survey Data

In Accra, between June and July 2024, we conducted a stormwater drainage survey to identify exposed drainage points along street segments that were clogged with solid waste (e.g., plastics and rubbish) and stagnant water, with the aim of reducing the burden of poor environmental sanitation and the risk of mosquito-borne infestation.

Following information has been captured in annotated images:

Use the information in the white text box of each image to construct a data frame in RStudio.

  1. Use the c( ) function with assignment operator (<-) to create the following vector objects with the names:
  • GoProID: it contains the image ID numbers
  • Latitude: it contains the y-coordinate of surveyed location
  • Longitude: it contains the x-coordinate of surveyed location
  • Sanitation: it contains information about the sanitation state of drain
  • SolidWaste: it describes presence or absence of solid waste materials in drain
  • Structure: it describes whether the structure of the drain was damaged (or not) or it being a “run off” drain
  • Stagnation: refers to flow obstruction in drain resulting in it be stagnated
  • Mosquitoes: refers to evidence of mosquito breeding in drain due to condition
  1. Use the data.frame() with the assignment operator (<-) to create the data frame object drainage_data.
  2. Use the str() to describe the structure of drainage_data data frame.
  3. Use the following row conditions i.e., drains classified as having Poor sanitary state AND it is a Breeding Spot for mosquitoes at the same time to perform a filter on the drainage_data object. At the same time, limit the filtered data to the following columns: GoProID, Latitude, Longitude, Sanitation and Mosquitoes
  4. Use the write.csv() to save your new filtered dataset.
Important

Please use the answer sheet HERE to insert a screenshot of the dataset, paste the full code syntax, and provide an interpretation of the dataset and workflow. Note that this Worksheet 1, in particular, along with Worksheets 2 and 3, will only be assessed in person, and feedback will also be given in person. Whereas Worksheets 4 and 5 - London Air Quality - will be formatively marked accordingly. You need to submit at least 2 (out of 5) QSkills worksheets in your final portfolio.