Data Management with the `tidyverse`

Adapted from parts of Mine Çetinkaya-Rundel’s tidyverse course

Nicky Wakim

2023-01-10

A cartoon of a fuzzy round monster face showing 10 different emotions experienced during the process of debugging code. The progression goes from (1) “I got this” - looking determined and optimistic; (2) “Huh. Really thought that was it.” - looking a bit baffled; (3) “...” - looking up at the ceiling in thought; (4) “Fine. Restarting.” - looking a bit annoyed; (5) “OH WTF.” Looking very frazzled and frustrated; (6) “Zombie meltdown.” - looking like a full meltdown; (7) (blank) - sleeping; (8) “A NEW HOPE!” - a happy looking monster with a lightbulb above; (9) “insert awesome theme song” - looking determined and typing away; (10) “I love coding” - arms raised in victory with a big smile, with confetti falling.

Artwork by @allison_horst

Introduction to the `tidyverse`

What is the tidyverse?

The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

ggplot2 - data visualisation
dplyr - data manipulation
tidyr - tidy data
readr - read rectangular data
purrr - functional programming
tibble - modern data frames
stringr - string manipulation
forcats - factors
and many more …

Tidy data¹

tidy data diagram

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Pipe operator (`magrittr`)

The pipe operator (%>%) allows us to step through sequential functions in the same way we follow if-then statements or steps from instructions

I want to find my keys, then start my car, then drive to work, then park my car.

Nested

park(drive(start_car(find("keys")), 
           to = "work"))

Piped

find("keys") %>%
  start_car() %>%
  drive(to = "work") %>%
  park()

Recoding a binary variable with pipe operator

Let’s say I want a variable transmission to show the category names that are assigned to numeric values in the code. I want 0 to be coded as automatic and 1 to be coded as manual.

Base R:

mtcars$transmission <-
  ifelse(
    mtcars$am == 0,
    "automatic",
    "manual"
  )

Tidyverse:

mtcars <- mtcars %>%
  mutate(
    transmission = case_when(
      am == 0 ~ "automatic",
      am == 1 ~ "manual"
    )
  )

mutate() creates new columns that are functions of existing variables

Recoding a multi-level variable

Let’s say I want a variable gear to show the category names that are assigned to numeric values in the code. I want 3 to be coded as gear three, 4 to be coded as gear four, 5 to be coded as gear five.

Base R:

mtcars$gear_char <-
  ifelse(
    mtcars$gear == 3,
    "three",
    ifelse(
      mtcars$gear == 4,
      "four",
      "five"
    )
  )

Tidyverse:

mtcars <- mtcars %>%
  mutate(
    gear_char = case_when(
      gear == 3 ~ "three",
      gear == 4 ~ "four",
      gear == 5 ~ "five"
    )
  )

`ggplot2` revisited

`ggplot2` in tidyverse

We talked about this in our review notes
- I want to revisit it: always helps to have more examples!
- This example is closer to the multivariable work we’ll do in this class!

ggplot2 is tidyverse’s data visualization package

The gg in “ggplot2” stands for Grammar of Graphics

It is inspired by the book Grammar of Graphics by Leland Wilkinson

Tidyverse: Visualizing multiple variables

ggplot(
  mtcars,
  aes(x = disp, y = mpg, color = transmission)) +
  geom_point()

Poll Everywhere Question 1

Tidyverse: Visualizing even more variables

ggplot(
  mtcars,
  aes(x = disp, y = mpg, color = transmission)) +
  geom_point() +
  facet_wrap(~ cyl)

Base R: Visualizing even more variables

mtcars$trans_color <- ifelse(mtcars$transmission == "automatic", "green", "blue")
mtcars_cyl4 = mtcars[mtcars$cyl == 4, ]
mtcars_cyl6 = mtcars[mtcars$cyl == 6, ]
mtcars_cyl8 = mtcars[mtcars$cyl == 8, ]
par(mfrow = c(1, 3), mar = c(2.5, 2.5, 2, 0), mgp = c(1.5, 0.5, 0))
plot(mpg ~ disp, data = mtcars_cyl4, col = trans_color, main = "Cyl 4")
plot(mpg ~ disp, data = mtcars_cyl6, col = trans_color, main = "Cyl 6")
plot(mpg ~ disp, data = mtcars_cyl8, col = trans_color, main = "Cyl 8")
legend("topright", legend = c("automatic", "manual"), pch = 1, col = c("green", "blue"))

Functions for data management

Important functions for data management

Data manipulation

pivot_longer() and pivot_wider()
rename()
mutate()
filter()
select()

Summarizing data

tbl_summary()
group_by()
summarize()
across()

Data manipulation

Example for `pivot_longer()`: Instructional staff employment trends

The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. This report by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below.

Data

Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year.

(staff <- read_csv("data/instructional-staff.csv"))

# A tibble: 5 × 12
  faculty_type    `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007`
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Full-Time Tenu…   29     27.6   25     24.8   21.8   20.3   19.3   17.8   17.2
2 Full-Time Tenu…   16.1   11.4   10.2    9.6    8.9    9.2    8.8    8.2    8  
3 Full-Time Non-…   10.3   14.1   13.6   13.6   15.2   15.5   15     14.8   14.9
4 Part-Time Facu…   24     30.4   33.1   33.2   35.5   36     37     39.3   40.5
5 Graduate Stude…   20.5   16.5   18.1   18.8   18.7   19     20     19.9   19.5
# ℹ 2 more variables: `2009` <dbl>, `2011` <dbl>

Poll Everywhere Question 2

Recreate the visualization

In order to recreate this visualization we need to first reshape the data:
- one variable for faculty type
- one variable for year

Convert the data from the wide format to long format
- pivot_longer()

`pivot_*()` functions

Poll Everywhere Question 3

Pivot staff data and mutate percentage

(staff_long <- staff %>%
  pivot_longer(
    cols = -faculty_type,    # columns to pivot
    names_to = "year",       # name of new column for variable names
    values_to = "percentage" # name of new column for values
  ) %>%
  mutate(percentage = as.numeric(percentage))
)

# A tibble: 55 × 3
   faculty_type              year  percentage
   <chr>                     <chr>      <dbl>
 1 Full-Time Tenured Faculty 1975        29  
 2 Full-Time Tenured Faculty 1989        27.6
 3 Full-Time Tenured Faculty 1993        25  
 4 Full-Time Tenured Faculty 1995        24.8
 5 Full-Time Tenured Faculty 1999        21.8
 6 Full-Time Tenured Faculty 2001        20.3
 7 Full-Time Tenured Faculty 2003        19.3
 8 Full-Time Tenured Faculty 2005        17.8
 9 Full-Time Tenured Faculty 2007        17.2
10 Full-Time Tenured Faculty 2009        16.8
# ℹ 45 more rows

A “meh” plot over the years

ggplot(staff_long, aes(x = percentage, y = year, fill = faculty_type)) +
  geom_col()

More improvement

staff_long %>%
  mutate( 
    part_time = if_else(faculty_type == "Part-Time Faculty",
                        "Part-Time Faculty", "Other Faculty"),
    year = as.numeric(year)) %>% 
  ggplot(
    aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) +
  geom_line() +
  scale_color_manual(values = c("gray", "red")) + 
  scale_y_continuous(labels = label_percent(accuracy = 1)) + 
  theme_minimal() +
  labs(
    title = "Instructional staff employment trends",
    x = "Year", y = "Percentage", color = NULL) +
  theme(legend.position = "bottom")

All that just to show one helpful function

Now we can move onto the other functions mentioned:

Data manipulation

pivot_longer() and pivot_wider()
rename()
mutate()
filter()
select()

Summarizing data

tbl_summary()
group_by()
summarize()
across()

Let’s look back at the `dds.discr` dataset that I briefly used last class

We will load the data (This is a special case! dds.discr is a built-in R dataset)

data("dds.discr")

Now, let’s take a glimpse at the dataset:

glimpse(dds.discr)

Rows: 1,000
Columns: 6
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender       <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity    <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

`rename()`: one of the first things I usually do

I notice that two variables have values that don’t necessarily match the variable name
- Female and male are not genders
- “White not Hispanic” combines race and ethnicity into one category

I want to rename gender to SAB (sex assigned at birth) and rename ethnicity to R_E (race and ethnicity)

dds.discr1 = dds.discr %>% 
  rename(SAB = gender, 
         R_E = ethnicity)

glimpse(dds.discr1)

Rows: 1,000
Columns: 6
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB          <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ R_E          <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…

`mutate()`: constructing new variables from what you have

We’ve seen a couple examples for mutate() so far (mostly because its used so often!)
We haven’t seen an example where we make a new variable from two variables

I want to make a variable that is the ratio of expenditures over age

dds.discr2 = dds.discr1 %>%
  mutate(exp_to_age = expenditures/age)

glimpse(dds.discr2)

Rows: 1,000
Columns: 7
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB          <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ R_E          <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
$ exp_to_age   <dbl> 124.2941, 1133.0811, 484.6667, 336.8421, 339.3846, 304.40…

Poll Everywhere Question 4

`filter()`: keep rows that match a condition

What if I want to subset the data frame? (keep certain rows of observations)

I want to look at the data for people who between 50 and 60 years old

dds.discr3 = dds.discr2 %>%
  filter(age >= 50 & age <= 60)

glimpse(dds.discr3)

Rows: 23
Columns: 7
$ id           <int> 15970, 19412, 29506, 31658, 36123, 39287, 39672, 43455, 4…
$ age.cohort   <fct> 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51…
$ age          <int> 51, 60, 56, 60, 59, 59, 54, 57, 52, 57, 55, 52, 59, 54, 5…
$ SAB          <fct> Female, Female, Female, Female, Male, Female, Female, Mal…
$ expenditures <int> 54267, 57702, 48215, 46873, 42739, 44734, 52833, 48363, 5…
$ R_E          <fct> White not Hispanic, White not Hispanic, White not Hispani…
$ exp_to_age   <dbl> 1064.0588, 961.7000, 860.9821, 781.2167, 724.3898, 758.20…

`select()`: keep or drop columns using their names and types

What if I want to remove or keep certain variables?

I want to only have age and expenditure in my data frame

dds.discr4 = dds.discr2 %>%
  select(age, expenditures)

glimpse(dds.discr4)

Rows: 1,000
Columns: 2
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…

Summarizing Data

`tbl_summary()` : table summary (1/2)

What if I want one of those fancy summary tables that are at the top of most research articles? (lovingly called “Table 1”)

library(gtsummary)
tbl_summary(dds.discr2)

Characteristic	N = 1,000¹
id	55,385 (31,809, 76,135)
age.cohort
0-5	82 (8.2%)
6-12	175 (18%)
13-17	212 (21%)
18-21	199 (20%)
22-50	226 (23%)
51+	106 (11%)
age	18 (12, 26)
SAB
Female	503 (50%)
Male	497 (50%)
expenditures	7,026 (2,899, 37,713)
R_E
American Indian	4 (0.4%)
Asian	129 (13%)
Black	59 (5.9%)
Hispanic	376 (38%)
Multi Race	26 (2.6%)
Native Hawaiian	3 (0.3%)
Other	2 (0.2%)
White not Hispanic	401 (40%)
exp_to_age	462 (274, 938)
¹ Median (IQR); n (%)

`tbl_summary()` : table summary (2/2)

Let’s make this more presentable

dds.discr2 %>%
  select(-id, -age.cohort, -exp_to_age) %>%
  tbl_summary(label = c(age ~ "Age", 
                        R_E ~ "Race/Ethnicity", 
                        SAB ~ "Sex Assigned at Birth", 
                        expenditures ~ "Expenditures") ,
              statistic = list(all_continuous() ~ "{mean} ({sd})"))

Characteristic	N = 1,000¹
Age	23 (18)
Sex Assigned at Birth
Female	503 (50%)
Male	497 (50%)
Expenditures	18,066 (19,543)
Race/Ethnicity
American Indian	4 (0.4%)
Asian	129 (13%)
Black	59 (5.9%)
Hispanic	376 (38%)
Multi Race	26 (2.6%)
Native Hawaiian	3 (0.3%)
Other	2 (0.2%)
White not Hispanic	401 (40%)
¹ Mean (SD); n (%)

`group_by()`: group by one or more variables

What if I want to quickly look at group differences?
It will not change how the data look, but changes the actions of following functions

I want to group my data by sex assigned at birth.

dds.discr5 = dds.discr2 %>%
  group_by(SAB)
glimpse(dds.discr5)

Rows: 1,000
Columns: 7
Groups: SAB [2]
$ id           <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB          <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ R_E          <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
$ exp_to_age   <dbl> 124.2941, 1133.0811, 484.6667, 336.8421, 339.3846, 304.40…

Let’s see how the groups change something like the summarize() function in the next slide

`summarize()`: summarize your data or grouped data into one row

What if I want to calculate specific descriptive statistics for my variables?
This function is often best used with group_by()
If only presenting the summaries, functions like tbl_summary() is better
summarize() creates a new data frame, which means you can plot and manipulate the summarized data

Over whole sample:

dds.discr2 %>% 
  summarize(
    ave = mean(expenditures),
    SD = sd(expenditures),
    med = median(expenditures))

# A tibble: 1 × 3
     ave     SD   med
   <dbl>  <dbl> <dbl>
1 18066. 19543.  7026

Grouped by sex assigned at birth:

dds.discr2 %>% 
  group_by(SAB) %>% 
  summarize(
    ave = mean(expenditures),
    SD = sd(expenditures),
    med = median(expenditures))

# A tibble: 2 × 4
  SAB       ave     SD   med
  <fct>   <dbl>  <dbl> <int>
1 Female 18130. 20020.  6400
2 Male   18001. 19068.  7219

`across()`: apply a function across multiple columns

Like group_by(), this function is often paired with another transformation function

I want all my integer values to have two significant figures.

dds.discr6 = dds.discr2 %>%
  mutate(across(where(is.integer), signif, digits = 2))

glimpse(dds.discr6)

Rows: 1,000
Columns: 7
$ id           <dbl> 10000, 10000, 10000, 11000, 11000, 11000, 11000, 11000, 1…
$ age.cohort   <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age          <dbl> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB          <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <dbl> 2100, 42000, 1500, 6400, 4400, 4600, 3900, 3900, 5000, 29…
$ R_E          <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
$ exp_to_age   <dbl> 124.2941, 1133.0811, 484.6667, 336.8421, 339.3846, 304.40…

Resources

`dplyr` resources

More dpylr functions to reference!

Additional details and examples are available in the vignettes:

and the dplyr 1.0.0 release blog posts:

R programming class at OHSU!

You can check out Dr. Jessica Minnier’s R class page if you want more notes, videos, etc.

The larger tidy ecosystem

Just to name a few…

Credit to Mine Çetinkaya-Rundel

These notes were built from Mine’s notes
- Most pages and code were left as she made them
- I changed a few things to match our class
Please see her Github repository for the original notes

Data Management with the tidyverse

Introduction to the tidyverse