library(tidyverse)
5 Data wrangling
“Without clean data, or clean enough data, your data science is worthless.”
–Michael Stonebraker
Data wrangling refers to the process of transforming and preparing raw data into a clean and structured format suitable for analysis. This involves various steps, such as data cleaning, reshaping, merging, and filtering, to ensure the dataset is ready for statistical analysis or visualisation.
In most cases, data wrangling can occupy a significant portion of the time required for data analysis, often more than other stages. In this chapter, we will explore common operations in data wrangling. You will learn how to perform these operations using either base R functions, tidyverse
functions, or both.
5.1 Load packages
The following packages will be used in this chapter. Please run these lines of code before proceeding to other sections.
5.2 Indexing
Indexing involves selecting specific elements within data structures, which can be done using []
.
5.2.1 Vector
First, let’s create a vector.
<- 1:10
vec_data vec_data
[1] 1 2 3 4 5 6 7 8 9 10
To select specific elements:
# Select a single element
2] vec_data[
[1] 2
# Select 3rd and 6th element
c(3,6)] vec_data[
[1] 3 6
5.2.2 Data frame
We will use iris
, a built-in dataset in R, to demonstrate indexing within a data frame. Detailed information about this dataset can be accessed by typing ?iris
in the Console.
?iris
Below is a summary of this dataset.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
We will use []
, where the general syntax is [row, column]
.
# Selecting the 1st row
1, ] iris[
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
# Selecting the 1st and 2nd row
c(1, 2), ] iris[
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
Now let’s see how to index columns.
# Selecting the 1st row
1]
iris[,
# Selecting the 1st and 2nd row
c(1, 2)] iris[,
[1] 5.1 4.9 4.7 4.6 5.0 5.4
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
By default, this shows the entire column, but here we’ll limit the output to the first six items for clarity.
Instead of numbers, we can also use column names:
# Selecting the 1st row
"Sepal.Length"]
iris[,
# Selecting the 1st and 2nd row
c("Sepal.Length", "Sepal.Width")] iris[,
[1] 5.1 4.9 4.7 4.6 5.0 5.4
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
To select both a specific row and column, we can combine what we’ve learned. For example, to select the first row and the first column:
# Approach 1
1, 1] iris[
[1] 5.1
# Approach 2
1, "Sepal.Length"] iris[
[1] 5.1
To select the first five rows and the first two columns:
# Approach 1
1:5, 1:2] iris[
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
# Approach 2
1:5, c("Sepal.Length", "Sepal.Width")] iris[
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
For selecting a single column, an easier approach is to use $
:
$Petal.Length iris
[1] 1.4 1.4 1.3 1.5 1.4 1.7
5.2.3 Selecting and slicing
In R, there are many ways to perform tasks. Rather than using []
, dplyr
provides select()
and slice()
, which are often preferred for their readability. The dplyr
package is part of the tidyverse
.
The select()
function is used to choose specific columns.
# Select a single column
%>%
iris select(Sepal.Length)
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
# Select several columns
%>%
iris select(Sepal.Length, Sepal.Width)
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Similarly, slice()
is used to extract specific rows.
# Select a single row
%>%
iris slice(100)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 2.8 4.1 1.3 versicolor
# Select several rows
%>%
iris slice(2, 5, 100)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 5.0 3.6 1.4 0.2 setosa
3 5.7 2.8 4.1 1.3 versicolor
By combining both select()
and slice()
, we can access specific rows and columns.
%>%
iris select(Sepal.Length) %>%
slice(1:5)
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
5.3 Filtering
Filtering allows us to select rows based on a condition. For example, if we want to filter the Species
column for the value "setosa"
in the iris
dataset, we start by creating an index:
<- iris$Species == "setosa" ind
Next, we apply the index to the dataset.
iris[ind, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The ==
symbol is a logical operator. Table 5.1 presents the most common logical operators in R.
Operators | Description |
---|---|
< | Less than |
> | Greater than |
<= | Less than or equal to |
>= | Greater than or equal to |
== | Equal to |
!= | Not equal to |
Alternatively, we can use filter()
from dplyr
for the same result.
%>%
iris filter(Species == "setosa")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
filter()
is often more readable, especially for beginners, though both methods yield the same output.
Additionally, we can combine multiple conditions using |
(or) and &
(and). For instance, to filter the iris
dataset for rows where:
Species
is"setosa"
, andSepal.Length
is greater than 5.6 cm:
# Make an index
<- iris$Species == "setosa" & iris$Sepal.Length > 5.6
ind2
# Apply the index to the dataset
iris[ind2, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
19 5.7 3.8 1.7 0.3 setosa
Or, with filter()
:
%>%
iris filter(Species == "setosa" & Sepal.Length > 5.6)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.7 4.4 1.5 0.4 setosa
3 5.7 3.8 1.7 0.3 setosa
5.4 Sorting
Sorting arranges the rows of a data frame by specific column values.
To demonstrate, let’s limit iris
to its first ten rows for easier display. Using head()
will give the top ten rows, while tail()
provides the last rows.
<- iris %>% head(10)
iris_top10 iris_top10
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
To sort by a single column, use sort()
. Setting decreasing = FALSE
sorts in ascending order.
sort(iris_top10$Sepal.Length, decreasing = FALSE)
[1] 4.4 4.6 4.6 4.7 4.9 4.9 5.0 5.0 5.1 5.4
Another useful function is order()
, which returns the indices and can arrange the entire dataset.
# Make an index
<- order(iris_top10$Sepal.Length, decreasing = FALSE)
ind3
# Apply the index to the dataset
iris_top10[ind3, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
9 4.4 2.9 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
3 4.7 3.2 1.3 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
1 5.1 3.5 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
With tidyverse
, we have arrange()
, equivalent to order()
in base R:
%>%
iris_top10 arrange(Sepal.Length)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.4 2.9 1.4 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa
3 4.6 3.4 1.4 0.3 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 5.1 3.5 1.4 0.2 setosa
10 5.4 3.9 1.7 0.4 setosa
To sort in descending order, use desc()
:
%>%
iris_top10 arrange(desc(Sepal.Length))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.1 3.5 1.4 0.2 setosa
3 5.0 3.6 1.4 0.2 setosa
4 5.0 3.4 1.5 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 4.7 3.2 1.3 0.2 setosa
8 4.6 3.1 1.5 0.2 setosa
9 4.6 3.4 1.4 0.3 setosa
10 4.4 2.9 1.4 0.2 setosa
5.5 Rename
First, let’s check the column names.
colnames(iris_top10)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Suppose we want to rename the species
column to type
.
colnames(iris_top10)[5] <- "type"
Now, if we check the column names again:
colnames(iris_top10)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "type"
Alternatively, we can use the rename()
function from dplyr
. Here, we will rename Sepal.Length
to LengthofSepal
.
# Change the column name
<-
iris_top10 %>%
iris_top10 rename(LengthofSepal = "Sepal.Length")
# Check the column names
colnames(iris_top10)
[1] "LengthofSepal" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "type"
5.6 Create new column
Let’s create a smaller iris dataset first.
<- tail(iris, 10) iris_bottom10
Using base R functions, we can create a new column as follows:
$Sepal.Lengthx2 <- iris_bottom10$Sepal.Length * 2
iris_bottom10head(iris_bottom10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Lengthx2
141 6.7 3.1 5.6 2.4 virginica 13.4
142 6.9 3.1 5.1 2.3 virginica 13.8
143 5.8 2.7 5.1 1.9 virginica 11.6
144 6.8 3.2 5.9 2.3 virginica 13.6
145 6.7 3.3 5.7 2.5 virginica 13.4
146 6.7 3.0 5.2 2.3 virginica 13.4
Here, we create a new variable Sepal.Lengthx2
by multiplying Sepal.Length
by 2. Using tidyverse
, we can achieve this with mutate()
.
<-
iris_bottom10 %>%
iris_bottom10 mutate(Sepal.Widthx2 = Sepal.Width * 2)
head(iris_bottom10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Lengthx2
141 6.7 3.1 5.6 2.4 virginica 13.4
142 6.9 3.1 5.1 2.3 virginica 13.8
143 5.8 2.7 5.1 1.9 virginica 11.6
144 6.8 3.2 5.9 2.3 virginica 13.6
145 6.7 3.3 5.7 2.5 virginica 13.4
146 6.7 3.0 5.2 2.3 virginica 13.4
Sepal.Widthx2
141 6.2
142 6.2
143 5.4
144 6.4
145 6.6
146 6.0
5.7 Change data format
Data can be structured in two primary formats:
- Long format: Each row represents a single observation.
- Wide format: Each subject has one row, with variables across columns.
In long format, each row represents a single observation. This format is commonly used for data manipulation and analysis. Let’s create an example of a long-format dataset. Here, five participants were given a dietary supplement to reduce weight, and the data contains:
id
: participant IDtime
: either pre- or post-supplementweight
: participant’s weight
# Set seed for reproducibility
set.seed(123)
# Create a long format data
<- data.frame(
data_long id = rep(1:5, each = 2),
time = rep(c("Pre", "Post"), 5),
weight = sample(x = 60:80, size = 10, replace = TRUE)
)
# View the data
data_long
id time weight
1 1 Pre 74
2 1 Post 78
3 2 Pre 73
4 2 Post 62
5 3 Pre 69
6 3 Post 77
7 4 Pre 70
8 4 Post 64
9 5 Pre 79
10 5 Post 73
In wide format, each subject has a single row, with each measurement in a separate column. Here is data_long
converted to wide format.
# Set seed for reproducibility
set.seed(123)
# Create a wide format data
<- data.frame(
data_wide id = 1:5,
Pre = sample(60:80, 5, replace = TRUE),
Post = sample(60:80, 5, replace = TRUE)
)
# View the data
data_wide
id Pre Post
1 1 74 77
2 2 78 70
3 3 73 64
4 4 62 79
5 5 69 73
In the wide format, data is often easier to interpret. Converting between long and wide formats is simple in R. To transform data_wide
into long format, use pivot_longer()
:
<-
data_long2 %>%
data_wide pivot_longer(cols = 2:3, names_to = "time", values_to = "weight")
data_long2
# A tibble: 10 × 3
id time weight
<int> <chr> <int>
1 1 Pre 74
2 1 Post 77
3 2 Pre 78
4 2 Post 70
5 3 Pre 73
6 3 Post 64
7 4 Pre 62
8 4 Post 79
9 5 Pre 69
10 5 Post 73
For pivot_longer()
, we need to supply three arguments:
cols
: columns to be changed into a long format excluding the id columnnames_to
: name of a new column which consist of column names from a wide format datavalues_to
: name of a new column which consist of values fromcols
To convert data_long2
back to wide format, use pivot_wider()
:
<-
data_wide2 %>%
data_long2 pivot_wider(id_cols = "id", names_from = "time", values_from = "weight")
data_wide2
# A tibble: 5 × 3
id Pre Post
<int> <int> <int>
1 1 74 77
2 2 78 70
3 3 73 64
4 4 62 79
5 5 69 73
To use pivot_wider()
, the three basic arguments needed are:
id_cols
: ID columnnames_from
: name of a column to get the column names for the wide datavalues_from
: name of a column to get the values from
Both functions have additional arguments detailed in the Help
pane (use ?
in front of the function name).
5.8 Change variable type
Here are the main variable types in R:
Variables | Examples |
---|---|
Integer | 100, 77 |
Numeric | 100.2, 77.8 |
Character | “hello”, “ahmad” |
Logical | TRUE, FALSE |
Factor | “male”, “female” |
Date | 9/7/2024, 9 July 2024 |
The most common types in data are numeric, factor, and date. When importing data from software such as SPSS, STATA, or Excel, R may not always recognise the correct types. Let’s create a sample dataset and explore handling these issues.
# Create a data frame with mixed-up variable types
<- data.frame(
data_messed_up id = as.character(1:6),
score = as.character(sample(1:100, 6)),
gender = rep(c("Male", "Female"), length.out = 6)
)
# View the variable types
str(data_messed_up)
'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : chr "25" "90" "91" "69" ...
$ gender: chr "Male" "Female" "Male" "Female" ...
# View the data
data_messed_up
id score gender
1 1 25 Male
2 2 90 Female
3 3 91 Male
4 4 69 Female
5 5 98 Male
6 6 57 Female
We can see that score
should be numeric and gender
a factor. To convert these types, use as.numeric()
and as.factor()
. Let’s create a copy of data_messed_up
for demonstration purposes.
<- data_messed_up data_messed_up2
Using base R functions, we can adjust the types:
# Change score column to numeric
$score <- as.numeric(data_messed_up$score)
data_messed_up
# Change gender column to factor
$gender <- as.factor(data_messed_up$gender)
data_messed_up
# View variable type
str(data_messed_up)
'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 25 90 91 69 98 57
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
Using tidyverse
, we can achieve the same result with mutate()
:
# Change variable types for score and gender
<-
data_messed_up2 %>%
data_messed_up2 mutate(score = as.numeric(score),
gender = as.factor(gender))
# View variable type
str(data_messed_up2)
'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 25 90 91 69 98 57
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
5.8.1 Handling date
Dates can be tricky. For date variables, the standard format is YYYY-MM-DD
. Using lubridate
, we can work with various date formats easily. Let’s look at a few examples:
# Using base R
<- as.Date("2024-11-30")
date_data str(date_data)
Date[1:1], format: "2024-11-30"
# Using lubridate
<- as_date("2024-11-30")
date_data2 str(date_data2)
Date[1:1], format: "2024-11-30"
For non-standard formats, use lubridate
functions ymd()
, dmy()
, and mdy()
.
# DD-MM-YYYY
<- dmy("30-11-2024")
date_data3 str(date_data3)
Date[1:1], format: "2024-11-30"
# MM-DD-YY
<- mdy("11-30-2024")
date_data4 str(date_data4)
Date[1:1], format: "2024-11-30"
Correctly formatted date variables allow for operations such as:
Date calculation: adding days to a date.
+ 7 date_data4
[1] "2024-12-07"
Extracting date components usch as month and year.
# Extract month month(date_data4)
[1] 11
# Extract year year(date_data4)
[1] 2024
To demonstrate dates in a dataset, we’ll recreate data_messed_up
with a date column.
# Create a data frame with mixed-up variable types
<- data.frame(
data_messed_up id = as.character(1:6),
score = as.character(sample(1:100, 6)),
gender = rep(c("Male", "Female"), length.out = 6),
date = as.character(Sys.Date() - 1:6)
)
# View variable type
str(data_messed_up)
'data.frame': 6 obs. of 4 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : chr "92" "9" "93" "72" ...
$ gender: chr "Male" "Female" "Male" "Female" ...
$ date : chr "2024-12-07" "2024-12-06" "2024-12-05" "2024-12-04" ...
The date
column is recognised as a character. To convert date
to a date format, use as_date()
.
# Change variable types for score, gender, and date
<-
data_messed_up %>%
data_messed_up mutate(score = as.numeric(score),
gender = as.factor(gender),
date = as_date(date))
# View variable type
str(data_messed_up)
'data.frame': 6 obs. of 4 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 92 9 93 72 26 7
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
$ date : Date, format: "2024-12-07" "2024-12-06" ...
The simplest solution for the date issue is to make sure we properly input the date according to the right format (YYYY-MM-DD
) during the data collection process properly.
5.9 Merge datasets
If you collect data from different sources, you may need to combine datasets by row or column. rbind()
combines datasets by row, while cbind()
combines them by column.
Let us see how to combine two iris datasets.
# Data collected from area A
<- iris
data1
# Data collected from area B
<- iris
data2
# Combine both datasets by a row
<- rbind(data1, data2) data_combined_row
Now, we can check the dimensions of the data. The first element represents a row and the second element represents the column.
# Dimension of data1
dim(data1)
[1] 150 5
# Dimension of data2
dim(data2)
[1] 150 5
# Dimension of the combined data
dim(data_combined_row)
[1] 300 5
We can see that the rows of data1
and data2
each are 150. Combining both data by a row gives us 300 rows. It is to be noted that to use rbind()
, both data should have the same column numbers and names. Additionally, rbind()
is not limited to two data only.
Next, let us see the cbind()
. Using the same data, we can combine both data by a column.
# Combine both datasets by a column
<- cbind(data1, data2) data_combined_col
Thus, by further checking the dimension, we can see that the total column of data_combined_col
is 10, which is the sum of 5 columns in each of the data1
and data2
, while the row remained the same.
# Dimension of data1
dim(data1)
[1] 150 5
# Dimension of data2
dim(data2)
[1] 150 5
# Dimension of the combined data
dim(data_combined_col)
[1] 150 10
However, similar to rbind()
, to use the cbind()
, the rows of both data should be identical. If participant A is in the first row of data1
, the, in data2
, participant A should also be in the first row.
There is another set of functions that is more efficient than cbind()
, in which we can combine two or more datasets according to the id. Let us create two datasets that are related and have the same ID.
# Set seed for reproducibility
set.seed(123)
# Create the first half of the data
<- data.frame(
data_half1 name = c("Ahmad", "Ali", "Cheng", "Rama", "Wei"),
height_cm = sample(160:180, 5, replace = FALSE)
)
# Create the second half of the data
<- data.frame(
data_half2 name = c("Ahmad", "Ali", "Cheng", "Rama", "Karim"),
weight_kg = sample(70:90, 5, replace = FALSE)
)
# The first dataset
data_half1
name height_cm
1 Ahmad 174
2 Ali 178
3 Cheng 173
4 Rama 162
5 Wei 169
# The second dataset
data_half2
name weight_kg
1 Ahmad 87
2 Ali 80
3 Cheng 74
4 Rama 83
5 Karim 88
Notice that the last row of both datasets is not similar. To combine both datasets, we can use either left_join()
or right_join()
. Both produce the same result. If there is a mismatch between the two datasets, left_join()
will keep the first dataset as a reference, and any of the id of the second dataset that does not match the first one will be removed. right_join()
works similar to the left_join()
but in the opposite.
# Combine the data
<-
data_full_left %>% #first datasets
data_half1 left_join(data_half2, by = "name") #second datasets
# View the combined data
data_full_left
name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Wei 169 NA
We can see that karim
in the second dataset is excluded. Let’s see what will happen if we use right_join()
.
# Combine the data
<-
data_full_right %>% #first datasets
data_half1 right_join(data_half2, by = "name") #second datasets
# View the combined data
data_full_right
name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Karim NA 88
We can see that Wei
in the first dataset is excluded. To include all the participants, despite the mismatch of the id is by using full_join()
.
# Combine the data
<-
data_full %>%
data_half1 full_join(data_half2, by = "name")
# View the combined data
data_full
name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Wei 169 NA
6 Karim NA 88
Both Wei
and Karim
are kept in the combined dataset. Additionally, there is inner_join()
, which will exclude both Wei
and Karim
. This function keeps the data that the name
is present in both datasets only.
# Combine the data
<-
data_full_inner %>%
data_half1 inner_join(data_half2, by = "name")
# View the combined data
data_full_inner
name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5.10 Chapter summary
In this chapter, we covered the most common and basic operations in data wrangling. More operations were not covered in this chapter as this book is intended for beginners.
To summarise, we have covered:
- How to select a column and a row
- How to filter a dataset based on a condition applied to columns
- How to rename and create a column
- How to manage variable types
- How to combine several datasets into one
5.11 Revision
Load
mtcars
dataset in R.head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Read about
mtcars
.?mtcars
How many cars have
mpg
> 25?How many cars have
mpg
> 25 and a number of carburetors of 2?How many cars have a V-shaped engine?
Change
vs
andam
into a factor.Change the wide format data below into a long or tidy format.
# Set seed for reproducibility set.seed(123) # Create a wide format data <- data.frame( data_wide id = 1:10, time1 = sample(1:100, 10, replace = TRUE), time2 = sample(1:100, 10, replace = TRUE), time3 = sample(1:100, 10, replace = TRUE), age = sample(18:80, 10, replace = TRUE) ) # View the data frame head(data_wide)
id time1 time2 time3 age 1 1 31 90 26 32 2 2 79 91 7 49 3 3 51 69 42 59 4 4 14 91 9 62 5 5 67 57 83 24 6 6 42 92 36 26
The result of the long format data should appear like this.
id age time_points time_minute 1 32 time1 31 1 32 time2 90 1 32 time3 26 2 49 time1 79 2 49 time2 91 2 49 time3 7 3 59 time1 51 3 59 time2 69 3 59 time3 42 4 62 time1 14 4 62 time2 91 4 62 time3 9 5 24 time1 67 5 24 time2 57 5 24 time3 83 6 26 time1 42 6 26 time2 92 6 26 time3 36 7 58 time1 50 7 58 time2 9 7 58 time3 78 8 27 time1 43 8 27 time2 93 8 27 time3 81 9 40 time1 14 9 40 time2 99 9 40 time3 43 10 44 time1 25 10 44 time2 72 10 44 time3 76