5  Data wrangling

“Without clean data, or clean enough data, your data science is worthless.”

–Michael Stonebraker

Data wrangling refers to the process of transforming and preparing raw data into a clean and structured format suitable for analysis. This involves various steps, such as data cleaning, reshaping, merging, and filtering, to ensure the dataset is ready for statistical analysis or visualisation.

In most cases, data wrangling can occupy a significant portion of the time required for data analysis, often more than other stages. In this chapter, we will explore common operations in data wrangling. You will learn how to perform these operations using either base R functions, tidyverse functions, or both.

5.1 Load packages

The following packages will be used in this chapter. Please run these lines of code before proceeding to other sections.

library(tidyverse)

5.2 Indexing

Indexing involves selecting specific elements within data structures, which can be done using [].

5.2.1 Vector

First, let’s create a vector.

vec_data <- 1:10
vec_data
 [1]  1  2  3  4  5  6  7  8  9 10

To select specific elements:

# Select a single element
vec_data[2]
[1] 2
# Select 3rd and 6th element
vec_data[c(3,6)]
[1] 3 6

5.2.2 Data frame

We will use iris, a built-in dataset in R, to demonstrate indexing within a data frame. Detailed information about this dataset can be accessed by typing ?iris in the Console.

?iris

Below is a summary of this dataset.

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

We will use [], where the general syntax is [row, column].

# Selecting the 1st row
iris[1, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
# Selecting the 1st and 2nd row
iris[c(1, 2), ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

Now let’s see how to index columns.

# Selecting the 1st row
iris[, 1]

# Selecting the 1st and 2nd row
iris[, c(1, 2)]
[1] 5.1 4.9 4.7 4.6 5.0 5.4
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

By default, this shows the entire column, but here we’ll limit the output to the first six items for clarity.

Instead of numbers, we can also use column names:

# Selecting the 1st row
iris[, "Sepal.Length"]

# Selecting the 1st and 2nd row
iris[, c("Sepal.Length", "Sepal.Width")]
[1] 5.1 4.9 4.7 4.6 5.0 5.4
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

To select both a specific row and column, we can combine what we’ve learned. For example, to select the first row and the first column:

# Approach 1
iris[1, 1]
[1] 5.1
# Approach 2
iris[1, "Sepal.Length"]
[1] 5.1

To select the first five rows and the first two columns:

# Approach 1
iris[1:5, 1:2]
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
# Approach 2
iris[1:5, c("Sepal.Length", "Sepal.Width")]
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6

For selecting a single column, an easier approach is to use $:

iris$Petal.Length
[1] 1.4 1.4 1.3 1.5 1.4 1.7

5.2.3 Selecting and slicing

In R, there are many ways to perform tasks. Rather than using [], dplyr provides select() and slice(), which are often preferred for their readability. The dplyr package is part of the tidyverse.

The select() function is used to choose specific columns.

# Select a single column
iris %>% 
  select(Sepal.Length)
  Sepal.Length
1          5.1
2          4.9
3          4.7
4          4.6
5          5.0
6          5.4
# Select several columns
iris %>% 
  select(Sepal.Length, Sepal.Width)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
4          4.6         3.1
5          5.0         3.6
6          5.4         3.9

Similarly, slice() is used to extract specific rows.

# Select a single row
iris %>% 
  slice(100)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.7         2.8          4.1         1.3 versicolor
# Select several rows
iris %>% 
  slice(2, 5, 100)
  Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          4.9         3.0          1.4         0.2     setosa
2          5.0         3.6          1.4         0.2     setosa
3          5.7         2.8          4.1         1.3 versicolor

By combining both select() and slice(), we can access specific rows and columns.

iris %>% 
  select(Sepal.Length) %>% 
  slice(1:5)
  Sepal.Length
1          5.1
2          4.9
3          4.7
4          4.6
5          5.0

5.3 Filtering

Filtering allows us to select rows based on a condition. For example, if we want to filter the Species column for the value "setosa" in the iris dataset, we start by creating an index:

ind <- iris$Species == "setosa"

Next, we apply the index to the dataset.

iris[ind, ]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

The == symbol is a logical operator. Table 5.1 presents the most common logical operators in R.

Table 5.1: Common logical operators in R.
Operators Description
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
== Equal to
!= Not equal to

Alternatively, we can use filter() from dplyr for the same result.

iris %>% 
  filter(Species == "setosa")
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

filter() is often more readable, especially for beginners, though both methods yield the same output.

Additionally, we can combine multiple conditions using | (or) and & (and). For instance, to filter the iris dataset for rows where:

  1. Species is "setosa", and
  2. Sepal.Length is greater than 5.6 cm:
# Make an index
ind2 <- iris$Species == "setosa" & iris$Sepal.Length > 5.6

# Apply the index to the dataset
iris[ind2, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
19          5.7         3.8          1.7         0.3  setosa

Or, with filter():

iris %>% 
  filter(Species == "setosa" & Sepal.Length > 5.6)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.8         4.0          1.2         0.2  setosa
2          5.7         4.4          1.5         0.4  setosa
3          5.7         3.8          1.7         0.3  setosa

5.4 Sorting

Sorting arranges the rows of a data frame by specific column values.

To demonstrate, let’s limit iris to its first ten rows for easier display. Using head() will give the top ten rows, while tail() provides the last rows.

iris_top10 <- iris %>% head(10)
iris_top10
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

To sort by a single column, use sort(). Setting decreasing = FALSE sorts in ascending order.

sort(iris_top10$Sepal.Length, decreasing = FALSE)
 [1] 4.4 4.6 4.6 4.7 4.9 4.9 5.0 5.0 5.1 5.4

Another useful function is order(), which returns the indices and can arrange the entire dataset.

# Make an index
ind3 <- order(iris_top10$Sepal.Length, decreasing = FALSE)

# Apply the index to the dataset
iris_top10[ind3, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
9           4.4         2.9          1.4         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
7           4.6         3.4          1.4         0.3  setosa
3           4.7         3.2          1.3         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
5           5.0         3.6          1.4         0.2  setosa
8           5.0         3.4          1.5         0.2  setosa
1           5.1         3.5          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa

With tidyverse, we have arrange(), equivalent to order() in base R:

iris_top10 %>% 
  arrange(Sepal.Length)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           4.4         2.9          1.4         0.2  setosa
2           4.6         3.1          1.5         0.2  setosa
3           4.6         3.4          1.4         0.3  setosa
4           4.7         3.2          1.3         0.2  setosa
5           4.9         3.0          1.4         0.2  setosa
6           4.9         3.1          1.5         0.1  setosa
7           5.0         3.6          1.4         0.2  setosa
8           5.0         3.4          1.5         0.2  setosa
9           5.1         3.5          1.4         0.2  setosa
10          5.4         3.9          1.7         0.4  setosa

To sort in descending order, use desc():

iris_top10 %>% 
  arrange(desc(Sepal.Length))
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.4         3.9          1.7         0.4  setosa
2           5.1         3.5          1.4         0.2  setosa
3           5.0         3.6          1.4         0.2  setosa
4           5.0         3.4          1.5         0.2  setosa
5           4.9         3.0          1.4         0.2  setosa
6           4.9         3.1          1.5         0.1  setosa
7           4.7         3.2          1.3         0.2  setosa
8           4.6         3.1          1.5         0.2  setosa
9           4.6         3.4          1.4         0.3  setosa
10          4.4         2.9          1.4         0.2  setosa

5.5 Rename

First, let’s check the column names.

colnames(iris_top10)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

Suppose we want to rename the species column to type.

colnames(iris_top10)[5] <- "type"

Now, if we check the column names again:

colnames(iris_top10)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "type"        

Alternatively, we can use the rename() function from dplyr. Here, we will rename Sepal.Length to LengthofSepal.

# Change the column name
iris_top10 <- 
  iris_top10 %>% 
  rename(LengthofSepal = "Sepal.Length")

# Check the column names
colnames(iris_top10)
[1] "LengthofSepal" "Sepal.Width"   "Petal.Length"  "Petal.Width"  
[5] "type"         

5.6 Create new column

Let’s create a smaller iris dataset first.

iris_bottom10 <- tail(iris, 10)

Using base R functions, we can create a new column as follows:

iris_bottom10$Sepal.Lengthx2 <- iris_bottom10$Sepal.Length * 2
head(iris_bottom10)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species Sepal.Lengthx2
141          6.7         3.1          5.6         2.4 virginica           13.4
142          6.9         3.1          5.1         2.3 virginica           13.8
143          5.8         2.7          5.1         1.9 virginica           11.6
144          6.8         3.2          5.9         2.3 virginica           13.6
145          6.7         3.3          5.7         2.5 virginica           13.4
146          6.7         3.0          5.2         2.3 virginica           13.4

Here, we create a new variable Sepal.Lengthx2 by multiplying Sepal.Length by 2. Using tidyverse, we can achieve this with mutate().

iris_bottom10 <- 
  iris_bottom10 %>% 
  mutate(Sepal.Widthx2 = Sepal.Width * 2)
head(iris_bottom10)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species Sepal.Lengthx2
141          6.7         3.1          5.6         2.4 virginica           13.4
142          6.9         3.1          5.1         2.3 virginica           13.8
143          5.8         2.7          5.1         1.9 virginica           11.6
144          6.8         3.2          5.9         2.3 virginica           13.6
145          6.7         3.3          5.7         2.5 virginica           13.4
146          6.7         3.0          5.2         2.3 virginica           13.4
    Sepal.Widthx2
141           6.2
142           6.2
143           5.4
144           6.4
145           6.6
146           6.0

5.7 Change data format

Data can be structured in two primary formats:

  • Long format: Each row represents a single observation.
  • Wide format: Each subject has one row, with variables across columns.

In long format, each row represents a single observation. This format is commonly used for data manipulation and analysis. Let’s create an example of a long-format dataset. Here, five participants were given a dietary supplement to reduce weight, and the data contains:

  1. id: participant ID
  2. time: either pre- or post-supplement
  3. weight: participant’s weight
# Set seed for reproducibility
set.seed(123) 

# Create a long format data
data_long <- data.frame(
  id = rep(1:5, each = 2),        
  time = rep(c("Pre", "Post"), 5),   
  weight =  sample(x = 60:80, size = 10, replace = TRUE)              
  )

# View the data
data_long
   id time weight
1   1  Pre     74
2   1 Post     78
3   2  Pre     73
4   2 Post     62
5   3  Pre     69
6   3 Post     77
7   4  Pre     70
8   4 Post     64
9   5  Pre     79
10  5 Post     73

In wide format, each subject has a single row, with each measurement in a separate column. Here is data_long converted to wide format.

# Set seed for reproducibility
set.seed(123)

# Create a wide format data
data_wide <- data.frame(
  id = 1:5,
  Pre = sample(60:80, 5, replace = TRUE),
  Post = sample(60:80, 5, replace = TRUE)
  )

# View the data
data_wide
  id Pre Post
1  1  74   77
2  2  78   70
3  3  73   64
4  4  62   79
5  5  69   73

In the wide format, data is often easier to interpret. Converting between long and wide formats is simple in R. To transform data_wide into long format, use pivot_longer():

data_long2 <- 
  data_wide %>% 
  pivot_longer(cols = 2:3, names_to = "time", values_to = "weight")
data_long2
# A tibble: 10 × 3
      id time  weight
   <int> <chr>  <int>
 1     1 Pre       74
 2     1 Post      77
 3     2 Pre       78
 4     2 Post      70
 5     3 Pre       73
 6     3 Post      64
 7     4 Pre       62
 8     4 Post      79
 9     5 Pre       69
10     5 Post      73

For pivot_longer(), we need to supply three arguments:

  • cols: columns to be changed into a long format excluding the id column
  • names_to: name of a new column which consist of column names from a wide format data
  • values_to: name of a new column which consist of values from cols

To convert data_long2 back to wide format, use pivot_wider():

data_wide2 <- 
  data_long2 %>% 
  pivot_wider(id_cols = "id", names_from = "time", values_from = "weight")
data_wide2
# A tibble: 5 × 3
     id   Pre  Post
  <int> <int> <int>
1     1    74    77
2     2    78    70
3     3    73    64
4     4    62    79
5     5    69    73

To use pivot_wider(), the three basic arguments needed are:

  • id_cols: ID column
  • names_from: name of a column to get the column names for the wide data
  • values_from: name of a column to get the values from

Both functions have additional arguments detailed in the Help pane (use ? in front of the function name).

5.8 Change variable type

Here are the main variable types in R:

Table 5.2: Example of each of variable type in R.
Variables Examples
Integer 100, 77
Numeric 100.2, 77.8
Character “hello”, “ahmad”
Logical TRUE, FALSE
Factor “male”, “female”
Date 9/7/2024, 9 July 2024

The most common types in data are numeric, factor, and date. When importing data from software such as SPSS, STATA, or Excel, R may not always recognise the correct types. Let’s create a sample dataset and explore handling these issues.

# Create a data frame with mixed-up variable types
data_messed_up <- data.frame(
  id = as.character(1:6),                
  score = as.character(sample(1:100, 6)),    
  gender = rep(c("Male", "Female"), length.out = 6)
  )

# View the variable types
str(data_messed_up)
'data.frame':   6 obs. of  3 variables:
 $ id    : chr  "1" "2" "3" "4" ...
 $ score : chr  "25" "90" "91" "69" ...
 $ gender: chr  "Male" "Female" "Male" "Female" ...
# View the data
data_messed_up
  id score gender
1  1    25   Male
2  2    90 Female
3  3    91   Male
4  4    69 Female
5  5    98   Male
6  6    57 Female

We can see that score should be numeric and gender a factor. To convert these types, use as.numeric() and as.factor(). Let’s create a copy of data_messed_up for demonstration purposes.

data_messed_up2 <- data_messed_up

Using base R functions, we can adjust the types:

# Change score column to numeric
data_messed_up$score <- as.numeric(data_messed_up$score)

# Change gender column to factor
data_messed_up$gender <- as.factor(data_messed_up$gender)

# View variable type
str(data_messed_up)
'data.frame':   6 obs. of  3 variables:
 $ id    : chr  "1" "2" "3" "4" ...
 $ score : num  25 90 91 69 98 57
 $ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1

Using tidyverse, we can achieve the same result with mutate():

# Change variable types for score and gender
data_messed_up2 <- 
  data_messed_up2 %>% 
  mutate(score = as.numeric(score),
         gender = as.factor(gender))

# View variable type
str(data_messed_up2)
'data.frame':   6 obs. of  3 variables:
 $ id    : chr  "1" "2" "3" "4" ...
 $ score : num  25 90 91 69 98 57
 $ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1

5.8.1 Handling date

Dates can be tricky. For date variables, the standard format is YYYY-MM-DD. Using lubridate, we can work with various date formats easily. Let’s look at a few examples:

# Using base R 
date_data <- as.Date("2024-11-30")
str(date_data)
 Date[1:1], format: "2024-11-30"
# Using lubridate
date_data2 <- as_date("2024-11-30")
str(date_data2)
 Date[1:1], format: "2024-11-30"

For non-standard formats, use lubridate functions ymd(), dmy(), and mdy().

# DD-MM-YYYY
date_data3 <- dmy("30-11-2024")
str(date_data3)
 Date[1:1], format: "2024-11-30"
# MM-DD-YY
date_data4 <- mdy("11-30-2024")
str(date_data4)
 Date[1:1], format: "2024-11-30"

Correctly formatted date variables allow for operations such as:

  • Date calculation: adding days to a date.

    date_data4 + 7 
    [1] "2024-12-07"
  • Extracting date components usch as month and year.

    # Extract month
    month(date_data4)
    [1] 11
    # Extract year
    year(date_data4)
    [1] 2024

To demonstrate dates in a dataset, we’ll recreate data_messed_up with a date column.

# Create a data frame with mixed-up variable types
data_messed_up <- data.frame(
  id = as.character(1:6),                
  score = as.character(sample(1:100, 6)),    
  gender = rep(c("Male", "Female"), length.out = 6),
  date = as.character(Sys.Date() - 1:6)
  )

# View variable type
str(data_messed_up)
'data.frame':   6 obs. of  4 variables:
 $ id    : chr  "1" "2" "3" "4" ...
 $ score : chr  "92" "9" "93" "72" ...
 $ gender: chr  "Male" "Female" "Male" "Female" ...
 $ date  : chr  "2024-12-07" "2024-12-06" "2024-12-05" "2024-12-04" ...

The date column is recognised as a character. To convert date to a date format, use as_date().

# Change variable types for score, gender, and date
data_messed_up <- 
  data_messed_up %>% 
  mutate(score = as.numeric(score),
         gender = as.factor(gender),
         date = as_date(date))

# View variable type
str(data_messed_up)
'data.frame':   6 obs. of  4 variables:
 $ id    : chr  "1" "2" "3" "4" ...
 $ score : num  92 9 93 72 26 7
 $ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
 $ date  : Date, format: "2024-12-07" "2024-12-06" ...

The simplest solution for the date issue is to make sure we properly input the date according to the right format (YYYY-MM-DD) during the data collection process properly.

5.9 Merge datasets

If you collect data from different sources, you may need to combine datasets by row or column. rbind() combines datasets by row, while cbind() combines them by column.

Let us see how to combine two iris datasets.

# Data collected from area A
data1 <- iris

# Data collected from area B
data2 <- iris

# Combine both datasets by a row
data_combined_row <- rbind(data1, data2)

Now, we can check the dimensions of the data. The first element represents a row and the second element represents the column.

# Dimension of data1
dim(data1)
[1] 150   5
# Dimension of data2
dim(data2)
[1] 150   5
# Dimension of the combined data
dim(data_combined_row)
[1] 300   5

We can see that the rows of data1 and data2 each are 150. Combining both data by a row gives us 300 rows. It is to be noted that to use rbind(), both data should have the same column numbers and names. Additionally, rbind() is not limited to two data only.

Next, let us see the cbind(). Using the same data, we can combine both data by a column.

# Combine both datasets by a column
data_combined_col <- cbind(data1, data2)

Thus, by further checking the dimension, we can see that the total column of data_combined_col is 10, which is the sum of 5 columns in each of the data1 and data2, while the row remained the same.

# Dimension of data1
dim(data1)
[1] 150   5
# Dimension of data2
dim(data2)
[1] 150   5
# Dimension of the combined data
dim(data_combined_col)
[1] 150  10

However, similar to rbind(), to use the cbind(), the rows of both data should be identical. If participant A is in the first row of data1, the, in data2, participant A should also be in the first row.

There is another set of functions that is more efficient than cbind(), in which we can combine two or more datasets according to the id. Let us create two datasets that are related and have the same ID.

# Set seed for reproducibility
set.seed(123)

# Create the first half of the data
data_half1 <- data.frame(
  name = c("Ahmad", "Ali", "Cheng", "Rama", "Wei"),
  height_cm = sample(160:180, 5, replace = FALSE)
  )

# Create the second half of the data
data_half2 <- data.frame(
  name = c("Ahmad", "Ali", "Cheng", "Rama", "Karim"),
  weight_kg = sample(70:90, 5, replace = FALSE)
  )

# The first dataset
data_half1
   name height_cm
1 Ahmad       174
2   Ali       178
3 Cheng       173
4  Rama       162
5   Wei       169
# The second dataset
data_half2
   name weight_kg
1 Ahmad        87
2   Ali        80
3 Cheng        74
4  Rama        83
5 Karim        88

Notice that the last row of both datasets is not similar. To combine both datasets, we can use either left_join() or right_join(). Both produce the same result. If there is a mismatch between the two datasets, left_join() will keep the first dataset as a reference, and any of the id of the second dataset that does not match the first one will be removed. right_join() works similar to the left_join() but in the opposite.

# Combine the data
data_full_left <- 
  data_half1 %>% #first datasets
  left_join(data_half2, by = "name") #second datasets

# View the combined data
data_full_left
   name height_cm weight_kg
1 Ahmad       174        87
2   Ali       178        80
3 Cheng       173        74
4  Rama       162        83
5   Wei       169        NA

We can see that karim in the second dataset is excluded. Let’s see what will happen if we use right_join().

# Combine the data
data_full_right <- 
  data_half1 %>% #first datasets
  right_join(data_half2, by = "name") #second datasets

# View the combined data
data_full_right
   name height_cm weight_kg
1 Ahmad       174        87
2   Ali       178        80
3 Cheng       173        74
4  Rama       162        83
5 Karim        NA        88

We can see that Wei in the first dataset is excluded. To include all the participants, despite the mismatch of the id is by using full_join().

# Combine the data
data_full <- 
  data_half1 %>% 
  full_join(data_half2, by = "name") 

# View the combined data
data_full
   name height_cm weight_kg
1 Ahmad       174        87
2   Ali       178        80
3 Cheng       173        74
4  Rama       162        83
5   Wei       169        NA
6 Karim        NA        88

Both Wei and Karim are kept in the combined dataset. Additionally, there is inner_join(), which will exclude both Wei and Karim. This function keeps the data that the name is present in both datasets only.

# Combine the data
data_full_inner <- 
  data_half1 %>% 
  inner_join(data_half2, by = "name") 

# View the combined data
data_full_inner
   name height_cm weight_kg
1 Ahmad       174        87
2   Ali       178        80
3 Cheng       173        74
4  Rama       162        83

5.10 Chapter summary

In this chapter, we covered the most common and basic operations in data wrangling. More operations were not covered in this chapter as this book is intended for beginners.

To summarise, we have covered:

  • How to select a column and a row
  • How to filter a dataset based on a condition applied to columns
  • How to rename and create a column
  • How to manage variable types
  • How to combine several datasets into one

5.11 Revision

  1. Load mtcars dataset in R.

    head(mtcars)
                       mpg cyl disp  hp drat    wt  qsec vs am gear carb
    Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
  2. Read aboutmtcars.

    ?mtcars
  3. How many cars have mpg > 25?

  4. How many cars have mpg > 25 and a number of carburetors of 2?

  5. How many cars have a V-shaped engine?

  6. Change vs and am into a factor.

  7. Change the wide format data below into a long or tidy format.

    # Set seed for reproducibility
    set.seed(123)  
    
    # Create a wide format data
    data_wide <- data.frame(
      id = 1:10,
      time1 = sample(1:100, 10, replace = TRUE),
      time2 = sample(1:100, 10, replace = TRUE),
      time3 = sample(1:100, 10, replace = TRUE), 
      age = sample(18:80, 10, replace = TRUE)
      )
    
    # View the data frame
    head(data_wide)
      id time1 time2 time3 age
    1  1    31    90    26  32
    2  2    79    91     7  49
    3  3    51    69    42  59
    4  4    14    91     9  62
    5  5    67    57    83  24
    6  6    42    92    36  26

    The result of the long format data should appear like this.

    id age time_points time_minute
    1 32 time1 31
    1 32 time2 90
    1 32 time3 26
    2 49 time1 79
    2 49 time2 91
    2 49 time3 7
    3 59 time1 51
    3 59 time2 69
    3 59 time3 42
    4 62 time1 14
    4 62 time2 91
    4 62 time3 9
    5 24 time1 67
    5 24 time2 57
    5 24 time3 83
    6 26 time1 42
    6 26 time2 92
    6 26 time3 36
    7 58 time1 50
    7 58 time2 9
    7 58 time3 78
    8 27 time1 43
    8 27 time2 93
    8 27 time3 81
    9 40 time1 14
    9 40 time2 99
    9 40 time3 43
    10 44 time1 25
    10 44 time2 72
    10 44 time3 76