03. Data manipulation
Description
Data manipulation using R refers to the process of transforming, cleaning, aggregating, and reorganizing data to extract valuable insights or prepare it for further analysis. This article mainly talk about several packages for data processing using R.
Base package
# take rows with 4 cylinders or less and select only columns "mpg" and "hp"
df_small <- df[df$cyl <= 4, c("mpg", "hp")]
head(df_small, 5) # check top 5 lines of the new data frame
# take rows with 4 cylinders or less and select only columns "mpg" and "hp"
df_small <- df[df$cyl <= 4, c("mpg", "hp")]
head(df_small, 5) # check top 5 lines of the new data frame
## mpg hp
## Datsun 710 22.8 93
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Fiat 128 32.4 66
## Honda Civic 30.4 52
# the same result can be achieved by function "subset()"
df_small <- subset(df, subset = df$cyl <= 4, select = c("mpg", "hp"))
head(df_small, 5) # check top 5 lines of the new data frame
## mpg hp
## Datsun 710 22.8 93
## Merc 240D 24.4 62
## Merc 230 22.8 95
## Fiat 128 32.4 66
## Honda Civic 30.4 52
# adding new columns to the data frame
df$model <- rownames(df) # store rownames as a column
head(df, 5)
# select all rows but only some columns
x1 <- df[ , -c(1:8)] # select all columns except first 8 columns
x2 <- df[ , c("mpg", "hp", "cyl", "model")] # select four columns with given names
# combine or join together two data frames by a common variable
df_new <- merge(x1, x2, by = "model")
head(df_new, 5)
## model am gear carb mpg hp cyl
## 1 AMC Javelin 0 3 2 15.2 150 8
## 2 Cadillac Fleetwood 0 3 4 10.4 205 8
## 3 Camaro Z28 0 3 4 13.3 245 8
## 4 Chrysler Imperial 0 3 4 14.7 230 8
## 5 Datsun 710 1 4 1 22.8 93 4
# group by two variables and make aggregation for other two variables
aggregate(cbind(mpg, hp) ~ am + cyl, data = df_new, FUN = "mean")
## am cyl mpg hp
## 1 0 4 22.90000 84.66667
## 2 1 4 28.07500 81.87500
## 3 0 6 19.12500 115.25000
## 4 1 6 20.56667 131.66667
## 5 0 8 15.05000 194.16667
## 6 1 8 15.40000 299.50000
# if you need to summarise only one variable, even tapply will do the job
# of grouping and making aggregation
tapply(mtcars$mpg, mtcars[, c("cyl", "am")], mean)
## am
##cyl 0 1
## 4 22.900 28.07500
## 6 19.125 20.56667
## 8 15.050 15.40000
Package Tidyverse
The tidyverse
is a collection of packages that use the same approach to data manipulation in R and allows to do many things in a better and more efficient way. You can install a package tidyverse
and get a set of 8 core packages, or you can install and then load each package individually when you need it.
You had already seen one package from tidyverse
collection - it was ready for reading and writing csv files. Later, you will get ggplot2 - the best tool for data visualisation, stringr for working with text and some others. This lecture topic is mostly about data frames transformation and package dplyr
.
dplyr
First is first, a website for dplyr package - https://dplyr.tidyverse.org/ You can find there an introduction into dplyr
and (the most important!) a cheat sheet with its main functions.
Main functions
# load the library before you can use it
# and suppress messages as there are too many of them
suppressMessages(library(dplyr))
# to pick rows by given criteria
# take only rows with number of cylinders equal 4
df_small <- filter(mtcars, cyl == 4)
head(df_small)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# to reorder rows by the values in one or multiple variables
# arrange rows by the value of mpg in the ascending order (from smallest to largest)
arrange(df_small, mpg)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# arrange rows by the value of mpg in the descending order (from largest to smallest)
# with the help of function desc()
arrange(df_small, desc(mpg))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
# to pick variables/columns by their names
# select only some columns (cyl, hp, mpg) from the original data frame
df_narrow <- select(mtcars, cyl, hp, mpg) # columns appear in this order
head(df_narrow)
## cyl hp mpg
## Mazda RX4 6 110 21.0
## Mazda RX4 Wag 6 110 21.0
## Datsun 710 4 93 22.8
## Hornet 4 Drive 6 110 21.4
## Hornet Sportabout 8 175 18.7
## Valiant 6 105 18.1
# to create new variables
# create a new column and set its values to rows names
# then create another column with miles per gallon per one horse power
df <- mutate(mtcars, model = rownames(mtcars), mpghp = mpg / hp)
# check the results and see two new columns
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb model mpghp
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 0.1909091
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 0.1909091
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 0.2451613
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive 0.1945455
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout 0.1068571
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant 0.1723810
# to aggregate multiple values to a single summary
# aggreagate all rows in to a single statistical summary
# n is a number of observations
# mu_mpg and sd_mpg are mean and standard deviations of miles per gallon
# median_hp is a median of horse power variable
# there can be any other summaries for any other variable
summarise(df, n = n(), mu_mpg = mean(mpg), sd_mpg = sd(mpg), median_hp = median(hp))
## n mu_mpg sd_mpg median_hp
## 1 32 20.09062 6.026948 123
# to change the scope of the above functions from the full data set into multiple groups organised by values in one or multiple variables.
# All function above worked with the full data set. However, you might want to split
# the data frame into some groups and then run the function for each group separately.
# Function group_by() would be handy here
# group the data frame by two variables at the same time - "cyl" and "am"
df <- group_by(mtcars, cyl, am)
# aggreagate all rows in the each group [!] in to a single statistical summary
# hence there is a row with a summary for each group
summarise(df, n = n(), mu_mpg = mean(mpg), sd_mpg = sd(mpg), median_hp = median(hp))
## # A tibble: 6 × 6
## # Groups: cyl [3]
## cyl am n mu_mpg sd_mpg median_hp
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 4 0 3 22.9 1.45 95
## 2 4 1 8 28.1 4.48 78.5
## 3 6 0 4 19.1 1.63 116.
## 4 6 1 3 20.6 0.751 110
## 5 8 0 12 15.0 2.77 180
## 6 8 1 2 15.4 0.566 300.
comprehensive use
# group the data frame in three groups by a number of cylinders
df <- group_by(mtcars, cyl)
# create a new column with miles per gallon
# per the average [!] horse power in the corresponding group
df <- mutate(df, mpghp = mpg / mean(hp))
# check top 5 rows of the resulted data frame
head(df, 5)
## # A tibble: 5 × 12
## # Groups: cyl [3]
## mpg cyl disp hp drat wt qsec vs am gear carb mpghp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 0.172
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 0.172
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 0.276
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 0.175
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0.0894
df <- select(mtcars, cyl, am, mpg, hp) # select only 4 columns to make more compact example
df <- group_by(df, cyl, am) # group a data frame by two variables
df <- arrange(df, mpg, .by_group = TRUE) # sort every group by value of "mpg"
head(df, 12)
## # A tibble: 12 × 4
## # Groups: cyl, am [3]
## cyl am mpg hp
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 21.5 97
## 2 4 0 22.8 95
## 3 4 0 24.4 62
## 4 4 1 21.4 109
## 5 4 1 22.8 93
## 6 4 1 26 91
## 7 4 1 27.3 66
## 8 4 1 30.4 52
## 9 4 1 30.4 113
## 10 4 1 32.4 66
## 11 4 1 33.9 65
## 12 6 0 17.8 123
summarise(df, mpg_mu = mean(mpg), hp_median = median(hp)) # summary for each group
## # A tibble: 6 × 4
## # Groups: cyl [3]
## cyl am mpg_mu hp_median
## <dbl> <dbl> <dbl> <dbl>
## 1 4 0 22.9 95
## 2 4 1 28.1 78.5
## 3 6 0 19.1 116.
## 4 6 1 20.6 110
## 5 8 0 15.0 180
## 6 8 1 15.4 300.
summarise(ungroup(df), # data frame is ungrouped
mpg_mu = mean(mpg), hp_median = median(hp)) # summary for a full data frame
## # A tibble: 1 × 2
## mpg_mu hp_median
## <dbl> <dbl>
## 1 20.1 123
# Functions group_by() and summarise() are two most common functions
# to go together and they make a replacement for the function aggregate()
# presented in the beginning but with much better performance for large
# data sets and much higher flexibility.
# group by two variable and make aggregation for two other variables
aggregate(cbind(mpg, hp) ~ am + cyl, data = mtcars, FUN = "mean")
# exact replication of the above example with aggregate()
temp <- group_by(mtcars, cyl, am)
summarise(temp, mpg = mean(mpg), hp = mean(hp))
The syntax of all functions is the same. The first element is a data frame. The subsequent arguments are instructions on what to do with the data using variable names without quotation marks. The result of these functions is a data frame too.
Fine print note: in reality, the result of all functions in dplyr is not always a data frame, it can be an object of class tibble introduced by the package with the same name from the tidyverse collection. Tibble is an enhanced version of a data frame and from the practical point you can think about it and treat it as a data frame. Also, for some functions the result might be of a different data type. However, in the most situations dplyr is about data frames getting in and out.
Piping
The data frame which used to be the first parameter of all dplyr
functions disappeared. The reason is that whatever object you pipe into a function, becomes the first parameter to take variables from and to apply functions.
The original data frame can be assessed by every function even if it is not mentioned explicitly. You can do it by operator (.) - just dot. In the example below, there is a new variable created with the number of rows in the original data frame.
mtcars %>% # take the data frame for piping
select(mpg, hp) %>% # select two variables - just to make an output shorter
mutate(nn = nrow(.)) %>% # create a new column with the value of rows count
head(5) # see top 5 rows of the resulted data frame
## mpg hp nn
## Mazda RX4 21.0 110 32
## Mazda RX4 Wag 21.0 110 32
## Datsun 710 22.8 93 32
## Hornet 4 Drive 21.4 110 32
## Hornet Sportabout 18.7 175 32
# Function nrow() is a part of the base package and it would not know
# how to work in dplyr environment. Operator (.) makes a bridge between
# base and dplyr and allows to assess the data frame used as the “original”
# first attribute in the function.
Piping is available in many programming languages and it was introduced in R by packages magrittr
and pipeR
. Package dplyr
has only one piping operator %>%
. If you check the other two packages, you find several other piping operators. For example, * %>>%
claimed to be “somewhat” quicker and has multiple “alternative” ways to use piping. Check the help file by help("%>>%")
after loading package pipeR
. * %T>%
Tee operator to split the data flow and put the same data frame into two different functions. * %$%
Exposition operator to pipe not a full data set but allow the following function to access (or expose) individual variables from it.
library(magrittr)
mtcars %>% # data frame to start with
select(mpg, hp) %T>% # select two columns of interest only and pipe them
plot(.) %>% # in both: function plot() for plotting and
colMeans() # function colMeans() to get averages
# "expose" columns from the data frame as separate variables
# to use with a basic function
mtcars %$% cor(mpg, hp)
# similar task with dplyr functionality results in a matrx
# as cor() takes all possible correlations in the data frame
mtcars %>%
select(mpg, hp) %>%
cor
library(pipeR)
mtcars$mpg %>>% # take variable "mpg" only
sample(size = 10000, replace = TRUE) %>>% # make a random draw with replacement
density(kernel = "gaussian") %>>% # calculate density function
plot(main = "density of mpg (bootstrap)") # plot the resu
library(pipeR)
apply(
1:10ˆ6, # below function will be repeated 1,000,000 times
function(i) {
sample(LETTERS, 4, replace = F) %>>% # draw 4 letters
paste(collapse = "") %>>% # concatenate them
"=="("COMP")
}) %>>% # check if result is COMP
sum # check how many times there is an equality
Joining data
Package base has a function merge()
, dplyr
brings a set of join functions.
band_members # data frame 1
band_instruments # data frame 2
# provide first and second data frames to join
# or other names - left and right data frames
inner_join(band_members, band_instruments, by = "name")
## # A tibble: 2 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
# an alternative is to use piping
# left data frame is piped into a function with right data frame provided as a parameter
band_members %>% inner_join(band_instruments, by="name")
## # A tibble: 2 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
# outer join - keep all rows from both data frames and fill missing cells by NA
band_members %>% full_join(band_instruments, by="name")
## # A tibble: 4 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones NA
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith NA guitar
# left join - keep all rows from the left data frame and only common from the right one
band_members %>% left_join(band_instruments, by="name")
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones NA
## 2 John Beatles guitar
## 3 Paul Beatles bass
# right join - keep all rows from the right data frame and only common from the left one
band_members %>% right_join(band_instruments, by="name")
## # A tibble: 3 × 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith NA guitar
# prepare data for the example
df1 <- starwars %>% select(name, height, gender, homeworld, species)
df2 <- starwars %>% select(mass, height, gender, homeworld, species)
df3 <- starwars %>% select(name, hair_color, skin_color, eye_color)
#### try to join that data back
# we can not use a single variable to join as their values are not unique
# result of the joining would be not what you expected, e.g.
df1 %>% inner_join(df2, by="gender") %>% head(5)
# however, using multiple common variables might be more productive
df1 %>% inner_join(df2, by=c("height", "gender", "homeworld", "species")) %>% head(5)
# joining three data frames by piping result of the first join into the second one
df1 %>%
inner_join(df2, by=c("height", "gender", "homeworld", "species")) %>%
inner_join(df3, by="name") %>%
head(5)
Other functions
mutate*
# create a small data set for the example
df <- mtcars %>% select(mpg, hp, cyl)
head(df)
## mpg hp cyl
## Mazda RX4 21.0 110 6
## Mazda RX4 Wag 21.0 110 6
## Datsun 710 22.8 93 4
## Hornet 4 Drive 21.4 110 6
## Hornet Sportabout 18.7 175 8
## Valiant 18.1 105 6
# do summation for each column and get new columns
df %>% mutate_all(list(mysum = ~ sum(.))) %>% head(5)
## mpg hp cyl mpg_mysum hp_mysum cyl_mysum
## Mazda RX4 21.0 110 6 642.9 4694 198
## Mazda RX4 Wag 21.0 110 6 642.9 4694 198
## Datsun 710 22.8 93 4 642.9 4694 198
## Hornet 4 Drive 21.4 110 6 642.9 4694 198
## Hornet Sportabout 18.7 175 8 642.9 4694 198
# normalise each column
df %>% mutate_all(list(norm = ~ scale(.))) %>% head(5)
## mpg hp cyl mpg_norm hp_norm cyl_norm
## Mazda RX4 21.0 110 6 0.1508848 -0.5350928 -0.1049878
## Mazda RX4 Wag 21.0 110 6 0.1508848 -0.5350928 -0.1049878
## Datsun 710 22.8 93 4 0.4495434 -0.7830405 -1.2248578
## Hornet 4 Drive 21.4 110 6 0.2172534 -0.5350928 -0.1049878
## Hornet Sportabout 18.7 175 8 -0.2307345 0.4129422 1.0148821
# mutate only selected columns, which names can be provided as character vector
my_cols <- c("hp", "mpg")
df %>% mutate_at(my_cols, list(norm = ~ scale(.))) %>% head(5)
## mpg hp cyl hp_norm mpg_norm
## Mazda RX4 21.0 110 6 -0.5350928 0.1508848
## Mazda RX4 Wag 21.0 110 6 -0.5350928 0.1508848
## Datsun 710 22.8 93 4 -0.7830405 0.4495434
## Hornet 4 Drive 21.4 110 6 -0.5350928 0.2172534
## Hornet Sportabout 18.7 175 8 0.4129422 -0.2307345
# mutate only columns that satisfy some criteria
# if column is type double - convert it to character
df %>% mutate_if(is.double, as.character) %>% head(5)
## mpg hp cyl
## Mazda RX4 21 110 6
## Mazda RX4 Wag 21 110 6
## Datsun 710 22.8 93 4
## Hornet 4 Drive 21.4 110 6
## Hornet Sportabout 18.7 175 8
####################################################
# mutate only columns according to "fancy" criteria
# and then use multiple functions for transformation
# prepare a custom function that return TRUE or FALSE for the column
my_test <- function(x) { any(x > 10) }
# use that "fancy" function to select columns for mutation
df %>% mutate_if(my_test, list(norm = ~ scale(.), sqrt = ~ sqrt(.))) %>% head(5)
## mpg hp cyl mpg_norm hp_norm mpg_sqrt hp_sqrt
## Mazda RX4 21.0 110 6 0.1508848 -0.5350928 4.582576 10.488088
## Mazda RX4 Wag 21.0 110 6 0.1508848 -0.5350928 4.582576 10.488088
## Datsun 710 22.8 93 4 0.4495434 -0.7830405 4.774935 9.643651
## Hornet 4 Drive 21.4 110 6 0.2172534 -0.5350928 4.626013 10.488088
## Hornet Sportabout 18.7 175 8 -0.2307345 0.4129422 4.324350 13.228757
transmute*
# simple alternative to mutate() - result is one column only
df %>% transmute(new_mpg = mpgˆ2) %>% head(5)
# error occurs
# create and keep only normalised versions of all variables
df %>% transmute_all(list(norm = ~ scale(.))) %>% head(5)
## mpg_norm hp_norm cyl_norm
## Mazda RX4 0.1508848 -0.5350928 -0.1049878
## Mazda RX4 Wag 0.1508848 -0.5350928 -0.1049878
## Datsun 710 0.4495434 -0.7830405 -1.2248578
## Hornet 4 Drive 0.2172534 -0.5350928 -0.1049878
## Hornet Sportabout -0.2307345 0.4129422 1.0148821
# create and keep normalised versions of selected variables only
my_cols <- c("hp", "mpg")
df %>% transmute_at(my_cols, list(norm = ~ scale(.))) %>% head(5)
## hp_norm mpg_norm
## Mazda RX4 -0.5350928 0.1508848
## Mazda RX4 Wag -0.5350928 0.1508848
## Datsun 710 -0.7830405 0.4495434
## Hornet 4 Drive -0.5350928 0.2172534
## Hornet Sportabout 0.4129422 -0.2307345
# use a custom function to select columns for mutation
# and keep only new variables
my_test <- function(x) { any(x > 10) }
df %>% transmute_if(my_test, list(norm = ~ scale(.), sqrt = ~ sqrt(.))) %>% head(5)
## mpg_norm hp_norm mpg_sqrt hp_sqrt
## Mazda RX4 0.1508848 -0.5350928 4.582576 10.488088
## Mazda RX4 Wag 0.1508848 -0.5350928 4.582576 10.488088
## Datsun 710 0.4495434 -0.7830405 4.774935 9.643651
## Hornet 4 Drive 0.2172534 -0.5350928 4.626013 10.488088
## Hornet Sportabout -0.2307345 0.4129422 4.324350 13.228757
select*
# pull values from the variable into a vector
mtcars %>% pull(mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
# You can find all of them in the help file by command ?tidyselect::select_helpers
# select only variables that starts with "d"
mtcars %>% select(starts_with("d")) %>% head(5)
## disp drat
## Mazda RX4 160 3.90
## Mazda RX4 Wag 160 3.90
## Datsun 710 108 3.85
## Hornet 4 Drive 258 3.08
## Hornet Sportabout 360 3.15
# select only variables that ends with "p"
mtcars %>% select(ends_with("p")) %>% head(5)
## disp hp
## Mazda RX4 160 110
## Mazda RX4 Wag 160 110
## Datsun 710 108 93
## Hornet 4 Drive 258 110
## Hornet Sportabout 360 175
# select only variables that contains "r"
mtcars %>% select(contains("r")) %>% head(5)
## drat gear carb
## Mazda RX4 3.90 4 4
## Mazda RX4 Wag 3.90 4 4
## Datsun 710 3.85 4 1
## Hornet 4 Drive 3.08 3 1
## Hornet Sportabout 3.15 3 2
# select only variables that match a regular expression
# variables containing "sp" or "ca"
mtcars %>% select(matches("sp|ca")) %>% head(5)
## disp carb
## Mazda RX4 160 4
## Mazda RX4 Wag 160 4
## Datsun 710 108 1
## Hornet 4 Drive 258 1
## Hornet Sportabout 360 2
# select variables listed in the character vector
my_vars <- c("mpg", "hp", "am")
mtcars %>% select(all_of(my_vars)) %>% head(5)
## mpg hp am
## Mazda RX4 21.0 110 1
## Mazda RX4 Wag 21.0 110 1
## Datsun 710 22.8 93 1
## Hornet 4 Drive 21.4 110 0
## Hornet Sportabout 18.7 175 0
Besides select_helpers functions, there are also alternatives with _all()
, _at()
and _if()
– similar to mutate()
function. You can select variables if they satisfy some criteria. You can use select_at()
with the vector of characters listing variables of interest.
Function select_all()
looks strange – the source data frame already has all variables, so there is no point to select them again. Well, select has some extra functionality – it can rename selected variables.
# select all variables and change their names for uppercase, then show top 5 rows only
mtcars %>% select_all(toupper) %>% head(5)
## MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
There is a wrapper of select()
function that makes the task of renaming more obvious and clear – rename()
. Both functions do a very similar job but the first one retains only selected variables while the second one keeps all variables.
my_cols <- c("hp", "mpg")
# select columns listed in the vector and change names to TitleCase
mtcars %>% select_at(my_cols, tools::toTitleCase) %>% head(5)
## Hp Mpg
## Mazda RX4 110 21.0
## Mazda RX4 Wag 110 21.0
## Datsun 710 93 22.8
## Hornet 4 Drive 110 21.4
## Hornet Sportabout 175 18.7
# change names of columns listed in the vector and keep everything else as is
mtcars %>% rename_at(my_cols, tools::toTitleCase) %>% head(5)
## Mpg cyl disp Hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
summarise*
To aggregate all data or data in each group you can use function summarise()
or summarize()
if you prefer American spelling. This function was presented in the section about main functions. Similar to mutate()
function there are scoped variants of summarise()
that allows to skip providing names for all variables you want to summarise/aggregate
.summarise_all()
applies a function to all variables.
# select only three columns to make example smaller
df <- mtcars %>% select(mpg, cyl, hp)
# get mean and median values for every variable/column in the data frame
df %>% summarise_all(list(m1 = mean, m2 = median))
## mpg_m1 cyl_m1 hp_m1 mpg_m2 cyl_m2 hp_m2
## 1 20.09062 6.1875 146.6875 19.2 6 123
summarise_at()
apply a function to variables selected with a character vector.
# prepare a character vector with variables to analyse
my_vars <- c("mpg", "hp", "cyl")
# get mean for each selected variable
mtcars %>% summarise_at(my_vars, mean)
## mpg hp cyl
## 1 20.09062 146.6875 6.1875
summarise_if()
apply a function to variables selected with a function that returns TRUE or FASLE for each variable
# get mean for variables that function is.numeric() returns TRUE
mtcars %>% summarise_if(is.numeric, mean)
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625 3.6875 2.8125
Most aggregation R functions can be used with summarise():
- mean(), sd(), var(), sum()
- median(), IQR() , min(), max(), quantile()
There are some uniquely dplyr functions that might be handy:
n()
- count number of rowsn_distinct()
- count unique valuesfirst()
,last()
,nth()
– first, last and nth value in a variable/column. The result will depend on the order of rows.
# get a number of rows, max value in "mpg", number of unique values in "cyl"
mtcars %>% summarise(n_rows = n(), max_mpg = max(mpg), n_cyl = n_distinct(cyl))
## n_rows max_mpg n_cyl
## 1 32 33.9 3
Package tidyr
Package tidyr
is another package from tidyverse collection. It focuses on making data “tidy”, that is, nice and ready for analysis using other tools and dplyr in particular. Here is a web site – https://tydir.tidyverse.org/index.html – for the package with the introduction and cheat sheet
Conversion between wide and long tables
Data we get from our clients or other sources are not always ready for analysis. It is really important to understand what are variables, observations and values in the data. This is not an obvious question. For example, below is a data set with results of religion and income survey.
# load all packages at once - dplyr, tidyr, purrr, readr and some others
library(tidyverse)
head(relig_income, 10)
## # A tibble: 10 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k~ 15 14 15 11 10 35 21
## 6 Evangel~ 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori~ 228 244 236 238 197 223 131
## 9 Jehovah~ 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## # ... with 3 more variables: $100-150k <dbl>, >150k <dbl>,
## # Don't know/refused <dbl>
However, if you use your data understanding then you see that there are only three real variables: name of a religion, income group and number of respondents. This format of representing data (with 11 columns as above) is called “wide table”, while for a proper data analysis you need (in most cases) so-called “long table” format (with 3 columns only). Likely for you, there are multiple tools to do a conversion.
# convert wide table into a long table
# key is a name of the new variable to put original column names as values
# value is a name of the new variable to store all values from original columns
# -religion is a variable that should stay as is and not to be included in "gathering"
relig_income_long <- gather(relig_income, key = "Income", value = "Counts", -religion)
# check the new data frame size - just three columns but 180 rows
# it is truely a long table
dim(relig_income_long)
Long table is good for data analysis and data visualisation. Wide table is good for reporting and for a small number of data analysis techniques. So, there is a backward conversion available.
# convert long table into a wide table
# key variable is used to create new columns
# value variable provides values for these new columns
relig_income_wide <- spread(relig_income_long, key = "Income", value = "Counts")
# check the new data frame size - it is wide and short
dim(relig_income_wide)
# convert to long table
relig_income_long <- relig_income %>% pivot_longer(col = -religion, names_to = "Income", values_to = "Counts")
dim(relig_income_long) # check the size of the long table
# convert to wide table
relig_income_wide <- relig_income_long %>% pivot_wider(names_from = "Income", values_from = "Counts")
dim(relig_income_wide) # check the size of the wide table
Missing values treatment
Very useful functionality of tidyr working well with dplyr is a treatment for missing values You can remove rows with missing values or replace them with something. For example
# create a small data frame for the example
df <- data.frame(x = c(1, 2, NA), y = c("a", NA, "b"), stringsAsFactors = FALSE)
df # have a look on the result - two variable "x" and "y"
## x y
## 1 1 a
## 2 2 <NA>
## 3 NA b
df %>% drop_na() # drop all rows with NA values
## x y
## 1 1 a
## 2 2 <NA>
# variables "x" and "y" are of different type, so we replace NA differently
df %>% replace_na(list(x = 0, y = "unknown"))
## x y
## 1 1 a
## 2 2 unknown
## 3 0 b
# replacement for one variable only
df %>% replace_na(list(x = 0))
## x y
## 1 1 a
## 2 2 <NA>
## 3 0 b
# alternative approach is to combine tidyr and dplyr and use replacement for a vector
# as every variable in a data frame is a vector. You can drop "list" here.
df %>% mutate(xx = replace_na(x, 0))
## x y xx
## 1 1 a 1
## 2 2 <NA> 2
## 3 NA b 0
Package purrr: Great tool for text analysis
One more package to mention here is purrr. As usual, there is a web site for the package https://purrr.tidyverse.org/index.html with the introduction and cheat sheet. As dplyr is a tool of choice for any kinds of data frame transformations, package purrr is an excellent tool for transformation of lists and vectors. Hence, if you work with structured or semi-structured data, which happens most of the time, you use dplyr.
If you get into the area of unstructured data, for example, text analysis, then you need another tool for data transformations and purrr is great here. You will do this type of analysis later.
The main function in purrr is map() and it has a large number of variants. Function map() does the same job as functions sapply() or lapply() presented previously but map() does this job way more efficient. As purrr is a part of tidyverse collection, it supports piping.
1:5 %>% # vector as an input
map(rnorm, n = 3) # some function applied to each element of the input vector
## [[1]]
## [1] 2.1153059 0.7227579 1.8153231
##
## [[2]]
## [1] 3.286483 1.340960 1.557657
##
## [[3]]
## [1] 2.876652 3.066912 3.075800
##
## [[4]]
## [1] 4.612661 2.392644 3.399425
##
## [[5]]
## [1] 3.522628 5.100503 5.504073
By default, result of function map()
is a list. However, there are alternatives that allows to output different types of vectors and even a data frame.
# prepare an input list
favorite_desserts <- list(Sophia = "banana bread",
Eliott = "pancakes",
Karina = "chocolate cake")
# apply function paste() and output a character vector
favorite_desserts %>% map_chr(~ paste(.x, "rocks!"))
## Sophia Eliott Karina
## "banana bread rocks!" "pancakes rocks!" "chocolate cake rocks!"
favorite_desserts %>% map_chr(function(x) paste(x, "rocks!"))
## Sophia Eliott Karina
## "banana bread rocks!" "pancakes rocks!" "chocolate cake rocks!"
mtcars %>% # input is a data frame
split(.$cyl) %>% # split mtcars in three data frames
map(~ lm(mpg ~ wt, data = .x)) %>% # run linear model for each data frame
map(summary) %>% # get summary of fits for each model
map_dbl("r.squared") # extract values of R-squared and store then as a vector