library(tidyverse)5 Data wrangling
“Without clean data, or clean enough data, your data science is worthless.”
–Michael Stonebraker
Data wrangling refers to the process of transforming and preparing raw data into a clean and structured format suitable for analysis. This involves various steps, such as data cleaning, reshaping, merging, and filtering, to ensure the dataset is ready for statistical analysis or visualisation.
In most cases, data wrangling can occupy a significant portion of the time required for data analysis, often more than other stages. In this chapter, we will explore common operations in data wrangling. You will learn how to perform these operations using either base R functions, tidyverse functions, or both.
5.1 Load packages
The following packages will be used in this chapter. Please run these lines of code before proceeding to other sections.
5.2 Indexing
Indexing involves selecting specific elements within data structures, which can be done using [].
5.2.1 Vector
First, let’s create a vector.
vec_data <- 1:10
vec_data [1] 1 2 3 4 5 6 7 8 9 10
To select specific elements:
# Select a single element
vec_data[2][1] 2
# Select 3rd and 6th element
vec_data[c(3,6)][1] 3 6
5.2.2 Data frame
We will use iris, a built-in dataset in R, to demonstrate indexing within a data frame. Detailed information about this dataset can be accessed by typing ?iris in the Console.
?irisBelow is a summary of this dataset.
summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
We will use [], where the general syntax is [row, column].
# Selecting the 1st row
iris[1, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
# Selecting the 1st and 2nd row
iris[c(1, 2), ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
Now let’s see how to index columns.
# Selecting the 1st row
iris[, 1]
# Selecting the 1st and 2nd row
iris[, c(1, 2)][1] 5.1 4.9 4.7 4.6 5.0 5.4
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
By default, this shows the entire column, but here we’ll limit the output to the first six items for clarity.
Instead of numbers, we can also use column names:
# Selecting the 1st row
iris[, "Sepal.Length"]
# Selecting the 1st and 2nd row
iris[, c("Sepal.Length", "Sepal.Width")][1] 5.1 4.9 4.7 4.6 5.0 5.4
Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
To select both a specific row and column, we can combine what we’ve learned. For example, to select the first row and the first column:
# Approach 1
iris[1, 1][1] 5.1
# Approach 2
iris[1, "Sepal.Length"][1] 5.1
To select the first five rows and the first two columns:
# Approach 1
iris[1:5, 1:2] Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
# Approach 2
iris[1:5, c("Sepal.Length", "Sepal.Width")] Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
For selecting a single column, an easier approach is to use $:
iris$Petal.Length[1] 1.4 1.4 1.3 1.5 1.4 1.7
5.2.3 Selecting and slicing
In R, there are many ways to perform tasks. Rather than using [], dplyr provides select() and slice(), which are often preferred for their readability. The dplyr package is part of the tidyverse.
The select() function is used to choose specific columns.
# Select a single column
iris %>%
select(Sepal.Length) Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
# Select several columns
iris %>%
select(Sepal.Length, Sepal.Width) Sepal.Length Sepal.Width
1 5.1 3.5
2 4.9 3.0
3 4.7 3.2
4 4.6 3.1
5 5.0 3.6
6 5.4 3.9
Similarly, slice() is used to extract specific rows.
# Select a single row
iris %>%
slice(100) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 2.8 4.1 1.3 versicolor
# Select several rows
iris %>%
slice(2, 5, 100) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 setosa
2 5.0 3.6 1.4 0.2 setosa
3 5.7 2.8 4.1 1.3 versicolor
By combining both select() and slice(), we can access specific rows and columns.
iris %>%
select(Sepal.Length) %>%
slice(1:5) Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
5.3 Filtering
Filtering allows us to select rows based on a condition. For example, if we want to filter the Species column for the value "setosa" in the iris dataset, we start by creating an index:
ind <- iris$Species == "setosa"Next, we apply the index to the dataset.
iris[ind, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The == symbol is a logical operator. Table 5.1 presents the most common logical operators in R.
| Operators | Description |
|---|---|
| < | Less than |
| > | Greater than |
| <= | Less than or equal to |
| >= | Greater than or equal to |
| == | Equal to |
| != | Not equal to |
Alternatively, we can use filter() from dplyr for the same result.
iris %>%
filter(Species == "setosa") Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
filter() is often more readable, especially for beginners, though both methods yield the same output.
Additionally, we can combine multiple conditions using | (or) and & (and). For instance, to filter the iris dataset for rows where:
Speciesis"setosa", andSepal.Lengthis greater than 5.6 cm:
# Make an index
ind2 <- iris$Species == "setosa" & iris$Sepal.Length > 5.6
# Apply the index to the dataset
iris[ind2, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
19 5.7 3.8 1.7 0.3 setosa
Or, with filter():
iris %>%
filter(Species == "setosa" & Sepal.Length > 5.6) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.7 4.4 1.5 0.4 setosa
3 5.7 3.8 1.7 0.3 setosa
5.4 Sorting
Sorting arranges the rows of a data frame by specific column values.
To demonstrate, let’s limit iris to its first ten rows for easier display. Using head() will give the top ten rows, while tail() provides the last rows.
iris_top10 <- iris %>% head(10)
iris_top10 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
To sort by a single column, use sort(). Setting decreasing = FALSE sorts in ascending order.
sort(iris_top10$Sepal.Length, decreasing = FALSE) [1] 4.4 4.6 4.6 4.7 4.9 4.9 5.0 5.0 5.1 5.4
Another useful function is order(), which returns the indices and can arrange the entire dataset.
# Make an index
ind3 <- order(iris_top10$Sepal.Length, decreasing = FALSE)
# Apply the index to the dataset
iris_top10[ind3, ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species
9 4.4 2.9 1.4 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
7 4.6 3.4 1.4 0.3 setosa
3 4.7 3.2 1.3 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
1 5.1 3.5 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
With tidyverse, we have arrange(), equivalent to order() in base R:
iris_top10 %>%
arrange(Sepal.Length) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.4 2.9 1.4 0.2 setosa
2 4.6 3.1 1.5 0.2 setosa
3 4.6 3.4 1.4 0.3 setosa
4 4.7 3.2 1.3 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 5.1 3.5 1.4 0.2 setosa
10 5.4 3.9 1.7 0.4 setosa
To sort in descending order, use desc():
iris_top10 %>%
arrange(desc(Sepal.Length)) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.4 3.9 1.7 0.4 setosa
2 5.1 3.5 1.4 0.2 setosa
3 5.0 3.6 1.4 0.2 setosa
4 5.0 3.4 1.5 0.2 setosa
5 4.9 3.0 1.4 0.2 setosa
6 4.9 3.1 1.5 0.1 setosa
7 4.7 3.2 1.3 0.2 setosa
8 4.6 3.1 1.5 0.2 setosa
9 4.6 3.4 1.4 0.3 setosa
10 4.4 2.9 1.4 0.2 setosa
5.5 Rename
First, let’s check the column names.
colnames(iris_top10)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Suppose we want to rename the species column to type.
colnames(iris_top10)[5] <- "type"Now, if we check the column names again:
colnames(iris_top10)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "type"
Alternatively, we can use the rename() function from dplyr. Here, we will rename Sepal.Length to LengthofSepal.
# Change the column name
iris_top10 <-
iris_top10 %>%
rename(LengthofSepal = "Sepal.Length")
# Check the column names
colnames(iris_top10)[1] "LengthofSepal" "Sepal.Width" "Petal.Length" "Petal.Width"
[5] "type"
5.6 Create new column
Let’s create a smaller iris dataset first.
iris_bottom10 <- tail(iris, 10)Using base R functions, we can create a new column as follows:
iris_bottom10$Sepal.Lengthx2 <- iris_bottom10$Sepal.Length * 2
head(iris_bottom10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Lengthx2
141 6.7 3.1 5.6 2.4 virginica 13.4
142 6.9 3.1 5.1 2.3 virginica 13.8
143 5.8 2.7 5.1 1.9 virginica 11.6
144 6.8 3.2 5.9 2.3 virginica 13.6
145 6.7 3.3 5.7 2.5 virginica 13.4
146 6.7 3.0 5.2 2.3 virginica 13.4
Here, we create a new variable Sepal.Lengthx2 by multiplying Sepal.Length by 2. Using tidyverse, we can achieve this with mutate().
iris_bottom10 <-
iris_bottom10 %>%
mutate(Sepal.Widthx2 = Sepal.Width * 2)
head(iris_bottom10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Lengthx2
141 6.7 3.1 5.6 2.4 virginica 13.4
142 6.9 3.1 5.1 2.3 virginica 13.8
143 5.8 2.7 5.1 1.9 virginica 11.6
144 6.8 3.2 5.9 2.3 virginica 13.6
145 6.7 3.3 5.7 2.5 virginica 13.4
146 6.7 3.0 5.2 2.3 virginica 13.4
Sepal.Widthx2
141 6.2
142 6.2
143 5.4
144 6.4
145 6.6
146 6.0
5.7 Change data format
Data can be structured in two primary formats:
- Long format: Each row represents a single observation.
- Wide format: Each subject has one row, with variables across columns.
In long format, each row represents a single observation. This format is commonly used for data manipulation and analysis. Let’s create an example of a long-format dataset. Here, five participants were given a dietary supplement to reduce weight, and the data contains:
id: participant IDtime: either pre- or post-supplementweight: participant’s weight
# Set seed for reproducibility
set.seed(123)
# Create a long format data
data_long <- data.frame(
id = rep(1:5, each = 2),
time = rep(c("Pre", "Post"), 5),
weight = sample(x = 60:80, size = 10, replace = TRUE)
)
# View the data
data_long id time weight
1 1 Pre 74
2 1 Post 78
3 2 Pre 73
4 2 Post 62
5 3 Pre 69
6 3 Post 77
7 4 Pre 70
8 4 Post 64
9 5 Pre 79
10 5 Post 73
In wide format, each subject has a single row, with each measurement in a separate column. Here is data_long converted to wide format.
# Set seed for reproducibility
set.seed(123)
# Create a wide format data
data_wide <- data.frame(
id = 1:5,
Pre = sample(60:80, 5, replace = TRUE),
Post = sample(60:80, 5, replace = TRUE)
)
# View the data
data_wide id Pre Post
1 1 74 77
2 2 78 70
3 3 73 64
4 4 62 79
5 5 69 73
In the wide format, data is often easier to interpret. Converting between long and wide formats is simple in R. To transform data_wide into long format, use pivot_longer():
data_long2 <-
data_wide %>%
pivot_longer(cols = 2:3, names_to = "time", values_to = "weight")
data_long2# A tibble: 10 × 3
id time weight
<int> <chr> <int>
1 1 Pre 74
2 1 Post 77
3 2 Pre 78
4 2 Post 70
5 3 Pre 73
6 3 Post 64
7 4 Pre 62
8 4 Post 79
9 5 Pre 69
10 5 Post 73
For pivot_longer(), we need to supply three arguments:
cols: columns to be changed into a long format excluding the id columnnames_to: name of a new column which consist of column names from a wide format datavalues_to: name of a new column which consist of values fromcols
To convert data_long2 back to wide format, use pivot_wider():
data_wide2 <-
data_long2 %>%
pivot_wider(id_cols = "id", names_from = "time", values_from = "weight")
data_wide2# A tibble: 5 × 3
id Pre Post
<int> <int> <int>
1 1 74 77
2 2 78 70
3 3 73 64
4 4 62 79
5 5 69 73
To use pivot_wider(), the three basic arguments needed are:
id_cols: ID columnnames_from: name of a column to get the column names for the wide datavalues_from: name of a column to get the values from
Both functions have additional arguments detailed in the Help pane (use ? in front of the function name).
5.8 Change variable type
Here are the main variable types in R:
| Variables | Examples |
|---|---|
| Integer | 100, 77 |
| Numeric | 100.2, 77.8 |
| Character | “hello”, “ahmad” |
| Logical | TRUE, FALSE |
| Factor | “male”, “female” |
| Date | 9/7/2024, 9 July 2024 |
The most common types in data are numeric, factor, and date. When importing data from software such as SPSS, STATA, or Excel, R may not always recognise the correct types. Let’s create a sample dataset and explore handling these issues.
# Create a data frame with mixed-up variable types
data_messed_up <- data.frame(
id = as.character(1:6),
score = as.character(sample(1:100, 6)),
gender = rep(c("Male", "Female"), length.out = 6)
)
# View the variable types
str(data_messed_up)'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : chr "25" "90" "91" "69" ...
$ gender: chr "Male" "Female" "Male" "Female" ...
# View the data
data_messed_up id score gender
1 1 25 Male
2 2 90 Female
3 3 91 Male
4 4 69 Female
5 5 98 Male
6 6 57 Female
We can see that score should be numeric and gender a factor. To convert these types, use as.numeric() and as.factor(). Let’s create a copy of data_messed_up for demonstration purposes.
data_messed_up2 <- data_messed_upUsing base R functions, we can adjust the types:
# Change score column to numeric
data_messed_up$score <- as.numeric(data_messed_up$score)
# Change gender column to factor
data_messed_up$gender <- as.factor(data_messed_up$gender)
# View variable type
str(data_messed_up)'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 25 90 91 69 98 57
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
Using tidyverse, we can achieve the same result with mutate():
# Change variable types for score and gender
data_messed_up2 <-
data_messed_up2 %>%
mutate(score = as.numeric(score),
gender = as.factor(gender))
# View variable type
str(data_messed_up2)'data.frame': 6 obs. of 3 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 25 90 91 69 98 57
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
5.8.1 Handling date
Dates can be tricky. For date variables, the standard format is YYYY-MM-DD. Using lubridate, we can work with various date formats easily. Let’s look at a few examples:
# Using base R
date_data <- as.Date("2024-11-30")
str(date_data) Date[1:1], format: "2024-11-30"
# Using lubridate
date_data2 <- as_date("2024-11-30")
str(date_data2) Date[1:1], format: "2024-11-30"
For non-standard formats, use lubridate functions ymd(), dmy(), and mdy().
# DD-MM-YYYY
date_data3 <- dmy("30-11-2024")
str(date_data3) Date[1:1], format: "2024-11-30"
# MM-DD-YY
date_data4 <- mdy("11-30-2024")
str(date_data4) Date[1:1], format: "2024-11-30"
Correctly formatted date variables allow for operations such as:
Date calculation: adding days to a date.
date_data4 + 7[1] "2024-12-07"Extracting date components usch as month and year.
# Extract month month(date_data4)[1] 11# Extract year year(date_data4)[1] 2024
To demonstrate dates in a dataset, we’ll recreate data_messed_up with a date column.
# Create a data frame with mixed-up variable types
data_messed_up <- data.frame(
id = as.character(1:6),
score = as.character(sample(1:100, 6)),
gender = rep(c("Male", "Female"), length.out = 6),
date = as.character(Sys.Date() - 1:6)
)
# View variable type
str(data_messed_up)'data.frame': 6 obs. of 4 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : chr "92" "9" "93" "72" ...
$ gender: chr "Male" "Female" "Male" "Female" ...
$ date : chr "2025-11-23" "2025-11-22" "2025-11-21" "2025-11-20" ...
The date column is recognised as a character. To convert date to a date format, use as_date().
# Change variable types for score, gender, and date
data_messed_up <-
data_messed_up %>%
mutate(score = as.numeric(score),
gender = as.factor(gender),
date = as_date(date))
# View variable type
str(data_messed_up)'data.frame': 6 obs. of 4 variables:
$ id : chr "1" "2" "3" "4" ...
$ score : num 92 9 93 72 26 7
$ gender: Factor w/ 2 levels "Female","Male": 2 1 2 1 2 1
$ date : Date, format: "2025-11-23" "2025-11-22" ...
The simplest solution for the date issue is to make sure we properly input the date according to the right format (YYYY-MM-DD) during the data collection process properly.
5.9 Merge datasets
If you collect data from different sources, you may need to combine datasets by row or column. rbind() combines datasets by row, while cbind() combines them by column.
Let us see how to combine two iris datasets.
# Data collected from area A
data1 <- iris
# Data collected from area B
data2 <- iris
# Combine both datasets by a row
data_combined_row <- rbind(data1, data2)Now, we can check the dimensions of the data. The first element represents a row and the second element represents the column.
# Dimension of data1
dim(data1)[1] 150 5
# Dimension of data2
dim(data2)[1] 150 5
# Dimension of the combined data
dim(data_combined_row)[1] 300 5
We can see that the rows of data1 and data2 each are 150. Combining both data by a row gives us 300 rows. It is to be noted that to use rbind(), both data should have the same column numbers and names. Additionally, rbind() is not limited to two data only.
Next, let us see the cbind(). Using the same data, we can combine both data by a column.
# Combine both datasets by a column
data_combined_col <- cbind(data1, data2)Thus, by further checking the dimension, we can see that the total column of data_combined_col is 10, which is the sum of 5 columns in each of the data1 and data2, while the row remained the same.
# Dimension of data1
dim(data1)[1] 150 5
# Dimension of data2
dim(data2)[1] 150 5
# Dimension of the combined data
dim(data_combined_col)[1] 150 10
However, similar to rbind(), to use the cbind(), the rows of both data should be identical. If participant A is in the first row of data1, the, in data2, participant A should also be in the first row.
There is another set of functions that is more efficient than cbind(), in which we can combine two or more datasets according to the id. Let us create two datasets that are related and have the same ID.
# Set seed for reproducibility
set.seed(123)
# Create the first half of the data
data_half1 <- data.frame(
name = c("Ahmad", "Ali", "Cheng", "Rama", "Wei"),
height_cm = sample(160:180, 5, replace = FALSE)
)
# Create the second half of the data
data_half2 <- data.frame(
name = c("Ahmad", "Ali", "Cheng", "Rama", "Karim"),
weight_kg = sample(70:90, 5, replace = FALSE)
)
# The first dataset
data_half1 name height_cm
1 Ahmad 174
2 Ali 178
3 Cheng 173
4 Rama 162
5 Wei 169
# The second dataset
data_half2 name weight_kg
1 Ahmad 87
2 Ali 80
3 Cheng 74
4 Rama 83
5 Karim 88
Notice that the last row of both datasets is not similar. To combine both datasets, we can use either left_join() or right_join(). Both produce the same result. If there is a mismatch between the two datasets, left_join() will keep the first dataset as a reference, and any of the id of the second dataset that does not match the first one will be removed. right_join() works similar to the left_join() but in the opposite.
# Combine the data
data_full_left <-
data_half1 %>% #first datasets
left_join(data_half2, by = "name") #second datasets
# View the combined data
data_full_left name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Wei 169 NA
We can see that karim in the second dataset is excluded. Let’s see what will happen if we use right_join().
# Combine the data
data_full_right <-
data_half1 %>% #first datasets
right_join(data_half2, by = "name") #second datasets
# View the combined data
data_full_right name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Karim NA 88
We can see that Wei in the first dataset is excluded. To include all the participants, despite the mismatch of the id is by using full_join().
# Combine the data
data_full <-
data_half1 %>%
full_join(data_half2, by = "name")
# View the combined data
data_full name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5 Wei 169 NA
6 Karim NA 88
Both Wei and Karim are kept in the combined dataset. Additionally, there is inner_join(), which will exclude both Wei and Karim. This function keeps the data that the name is present in both datasets only.
# Combine the data
data_full_inner <-
data_half1 %>%
inner_join(data_half2, by = "name")
# View the combined data
data_full_inner name height_cm weight_kg
1 Ahmad 174 87
2 Ali 178 80
3 Cheng 173 74
4 Rama 162 83
5.10 Chapter summary
In this chapter, we covered the most common and basic operations in data wrangling. More operations were not covered in this chapter as this book is intended for beginners.
To summarise, we have covered:
- How to select a column and a row
- How to filter a dataset based on a condition applied to columns
- How to rename and create a column
- How to manage variable types
- How to combine several datasets into one
5.11 Revision
Load
mtcarsdataset in R.head(mtcars)mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1Read about
mtcars.?mtcarsHow many cars have
mpg> 25?How many cars have
mpg> 25 and a number of carburetors of 2?How many cars have a V-shaped engine?
Change
vsandaminto a factor.Change the wide format data below into a long or tidy format.
# Set seed for reproducibility set.seed(123) # Create a wide format data data_wide <- data.frame( id = 1:10, time1 = sample(1:100, 10, replace = TRUE), time2 = sample(1:100, 10, replace = TRUE), time3 = sample(1:100, 10, replace = TRUE), age = sample(18:80, 10, replace = TRUE) ) # View the data frame head(data_wide)id time1 time2 time3 age 1 1 31 90 26 32 2 2 79 91 7 49 3 3 51 69 42 59 4 4 14 91 9 62 5 5 67 57 83 24 6 6 42 92 36 26The result of the long format data should appear like this.
id age time_points time_minute 1 32 time1 31 1 32 time2 90 1 32 time3 26 2 49 time1 79 2 49 time2 91 2 49 time3 7 3 59 time1 51 3 59 time2 69 3 59 time3 42 4 62 time1 14 4 62 time2 91 4 62 time3 9 5 24 time1 67 5 24 time2 57 5 24 time3 83 6 26 time1 42 6 26 time2 92 6 26 time3 36 7 58 time1 50 7 58 time2 9 7 58 time3 78 8 27 time1 43 8 27 time2 93 8 27 time3 81 9 40 time1 14 9 40 time2 99 9 40 time3 43 10 44 time1 25 10 44 time2 72 10 44 time3 76