tidyverse
2024-11-11
tidyverse
The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
magrittr
)%>%
) allows us to step through sequential functions in the same way we follow if-then statements or steps from instructions
I want to find my keys, then start my car, then drive to work, then park my car.
Data transformation
rename()
mutate()
pivot_longer()
and pivot_wider()
Data subsetting
filter()
select()
dds.discr
In the US, individuals with developmental disabilities typically receive services and support from state governments
Dataset dds.discr
dds.discr
dataset
dds.discr
is a built-in R dataset)Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ gender <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ ethnicity <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
Data transformation
rename()
mutate()
pivot_longer()
and pivot_wider()
Data subsetting
filter()
select()
rename()
: one of the first things I usually doI notice that two variables have values that don’t necessarily match the variable name
Female and male are not genders (NIH page on sex and gender)
“White not Hispanic” combines race and ethnicity into one category (APA page on race and ethnicity)
I want to rename gender to sex (not sure if assigned at birth or current sex) and rename ethnicity to R_E (race and ethnicity)
rename()
: one of the first things I usually dorename()
can change the name of a column
We use: data %>% rename(new_col_name = old_col_name)
Rows: 1,000
Columns: 6
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ R_E <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
Data transformation
rename()
mutate()
pivot_longer()
and pivot_wider()
Data subsetting
filter()
select()
mutate()
: constructing new variables from what you haveWe can create a new variable from other variables
We often use it like:
mutate()
: create a new variable from two other variablesI want to make a variable that is the ratio of expenditures over age
Rows: 1,000
Columns: 7
$ id <int> 10210, 10409, 10486, 10538, 10568, 10690, 10711, 10778, 1…
$ age.cohort <fct> 13-17, 22-50, 0-5, 18-21, 13-17, 13-17, 13-17, 13-17, 13-…
$ age <int> 17, 37, 3, 19, 13, 15, 13, 17, 14, 13, 13, 14, 15, 17, 20…
$ SAB <fct> Female, Male, Male, Female, Male, Female, Female, Male, F…
$ expenditures <int> 2113, 41924, 1454, 6400, 4412, 4566, 3915, 3873, 5021, 28…
$ R_E <fct> White not Hispanic, White not Hispanic, Hispanic, Hispani…
$ exp_to_age <dbl> 124.2941, 1133.0811, 484.6667, 336.8421, 339.3846, 304.40…
Can we recreate age.cohort
using the age
varible?
id age.cohort age SAB expenditures
Min. :10210 0-5 : 82 Min. : 0.0 Female:503 Min. : 222
1st Qu.:31809 6-12 :175 1st Qu.:12.0 Male :497 1st Qu.: 2899
Median :55384 13-17:212 Median :18.0 Median : 7026
Mean :54663 18-21:199 Mean :22.8 Mean :18066
3rd Qu.:76135 22-50:226 3rd Qu.:26.0 3rd Qu.:37713
Max. :99898 51+ :106 Max. :95.0 Max. :75098
R_E exp_to_age
White not Hispanic:401 Min. : 27.57
Hispanic :376 1st Qu.:273.88
Asian :129 Median :461.75
Black : 59 Mean : Inf
Multi Race : 26 3rd Qu.:938.12
American Indian : 4 Max. : Inf
(Other) : 5
mutate()
case_when()
is a helpful function for mapping values to a categorydds.discr
?
Data transformation
rename()
mutate()
pivot_longer()
and pivot_wider()
Data subsetting
filter()
select()
filter()
: keep rows that match a conditionI want to look at the data for people who between 50 and 60 years old
Rows: 23
Columns: 8
$ id <int> 15970, 19412, 29506, 31658, 36123, 39287, 39672, 43455, 4…
$ age.cohort <fct> 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51+, 51…
$ age <int> 51, 60, 56, 60, 59, 59, 54, 57, 52, 57, 55, 52, 59, 54, 5…
$ SAB <fct> Female, Female, Female, Female, Male, Female, Female, Mal…
$ expenditures <int> 54267, 57702, 48215, 46873, 42739, 44734, 52833, 48363, 5…
$ R_E <fct> White not Hispanic, White not Hispanic, White not Hispani…
$ exp_to_age <dbl> 1064.0588, 961.7000, 860.9821, 781.2167, 724.3898, 758.20…
$ age.cohort2 <chr> "51+", "51+", "51+", "51+", "51+", "51+", "51+", "51+", "…
Data transformation
rename()
mutate()
pivot_longer()
and pivot_wider()
Data subsetting
filter()
select()
select()
: keep or drop columns using their names and typesI want to only have age and expenditure in my data frame
dplyr
resourcesAdditional details and examples are available in the vignettes:
and the dplyr 1.0.0 release blog posts:
You can check out Dr. Jessica Minnier’s R class page if you want more notes, videos, etc.
Just to name a few…
These notes were built from Mine’s notes
Most pages and code were left as she made them
I changed a few things to match our class
Please see her Github repository for the original notes
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
Age Group | Hypertension | No Hypertension |
---|---|---|
18-39 yrs | 8836 | 112206 |
40-59 yrs | 42109 | 88663 |
60+ yrs | 39917 | 21589 |
pivot_*()
functionspivot_longer()
to create tidy data (1/2)Note that you won’t be required to use pivot_longer()
Here’s the original data frame:
data.frame()
to make a data framepivot_longer()
to create tidy data (2/2)We need to tell pivot_longer()
:
hyp_data1 = pivot_longer(
data = hyp_cont,
cols = -Age_Group, # columns to pivot
names_to = "Hypertension", # name of new column for variable names
values_to = "Counts") # name of new column for values
hyp_data1
# A tibble: 6 × 3
Age_Group Hypertension Counts
<chr> <chr> <dbl>
1 18-39 years Hypertension 8836
2 18-39 years No_Hypertension 112206
3 40-59 years Hypertension 42109
4 40-59 years No_Hypertension 88663
5 60+ years Hypertension 39917
6 60+ years No_Hypertension 21589
# A tibble: 10 × 2
Age_Group Hypertension
<chr> <chr>
1 18-39 years Hypertension
2 18-39 years Hypertension
3 18-39 years Hypertension
4 18-39 years Hypertension
5 18-39 years Hypertension
6 18-39 years Hypertension
7 18-39 years Hypertension
8 18-39 years Hypertension
9 18-39 years Hypertension
10 18-39 years Hypertension
R08 Slides