With an emphasis on prediction
2025-05-14
Understand the place of LASSO regression within association and prediction modeling for binary outcomes.
Recognize the process for tidymodels
Understand how penalized regression is a form of model/variable selection.
Perform LASSO regression on a dataset using R and the general process for classification methods.
Keep in mind:
Recognize the process for tidymodels
Understand how penalized regression is a form of model/variable selection.
Perform LASSO regression on a dataset using R and the general process for classification methods.
Model selection: picking the “best” model from a set of possible models
Models will have the same outcome, but typically differ by the covariates that are included, their transformations, and their interactions
“Best” model is defined by the research question and by how you want to answer it!
Model selection strategies: a process or framework that helps us pick our “best” model
Recall from 512/612: MSE can be written as a function of the bias and variance
\[ MSE = \text{bias}\big(\widehat\beta\big)^2 + \text{variance}\big(\widehat\beta\big) \]
For the same data:
More covariates in model: less bias, more variance
Less covariates in model: more bias, less variance
From Data Science in a Box:
Association / Explanatory / One variable’s effect
Goal: Understand one variable’s (or a group of variable’s) effect on the response after adjusting for other factors
Mainly interpret odds ratios of the variable that is the focus of the study
Prediction
Goal: to calculate the most precise prediction of the response variable
Interpreting coefficients is not important
Choose only the variables that are strong predictors of the response variable
Association / Explanatory / One variable’s effect
Pre-specification of multivariable model
Purposeful model selection
Change in Estimate (CIE) approaches
Prediction
We CAN use purposeful selection from last quarter in any type of generalized linear model (GLM)
The best documented information on purposeful selection is in the Hosmer-Lemeshow textbook on logistic regression
Purposeful selection starts on page 89 (or page 101 in the pdf)
I will not discuss purposeful selection in this course
Classification: process of predicting categorical responses/outcomes
Note: we’ve already done a lot of work around predicting probabilities within logistic regression
Common classification methods (good site on brief explanation of each)
Prediction depends on type of variable/model selection!
So the big question is: how do we select this model??
tidymodelsUnderstand how penalized regression is a form of model/variable selection.
Perform LASSO regression on a dataset using R and the general process for classification methods.
tidymodels is a great package when we are performing prediction
tidymodels syntax dictates that we need to define:
tidymodels with GLOWTo fit our logistic regression model with the interaction between age and prior fracture, we use:
step_dummy()step_interactions()glm()tidymodels with GLOW: Resultstidymodels| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | −1.376 | 0.134 | −10.270 | 0.000 | −1.646 | −1.120 |
| age_c | 0.063 | 0.015 | 4.043 | 0.000 | 0.032 | 0.093 |
| priorfrac_Yes | 1.002 | 0.240 | 4.184 | 0.000 | 0.530 | 1.471 |
| age_c_x_priorfrac_Yes | −0.057 | 0.025 | −2.294 | 0.022 | −0.107 | −0.008 |
tidy(glow_m3, conf.int = T) %>% gt() %>%
tab_options(table.font.size = 35) %>%
fmt_number(decimals = 3)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | −1.376 | 0.134 | −10.270 | 0.000 | −1.646 | −1.120 |
| priorfracYes | 1.002 | 0.240 | 4.184 | 0.000 | 0.530 | 1.471 |
| age_c | 0.063 | 0.015 | 4.043 | 0.000 | 0.032 | 0.093 |
| priorfracYes:age_c | −0.057 | 0.025 | −2.294 | 0.022 | −0.107 | −0.008 |
Interaction model: \[\begin{aligned} \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = \widehat\beta_0 &+ &\widehat\beta_1\cdot I(\text{PF}) & + &\widehat\beta_2\cdot Age& + &\widehat\beta_3 \cdot I(\text{PF}) \cdot Age \\ \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = -1.376 &+ &1.002\cdot I(\text{PF})& + &0.063\cdot Age& -&0.057 \cdot I(\text{PF}) \cdot Age \end{aligned}\]
Understand the place of LASSO regression within association and prediction modeling for binary outcomes.
Recognize the process for tidymodels
Basic idea: We are running regression, but now we want to incentivize our model fit to have less predictors
We need a tuning parameter that determines the amount of shrinkage called lambda/\(\lambda\)
Main difference is the type of penalty used
Ridge regression
Penalty called L2 norm, uses squared values
Pros
Cons
Lasso regression
Elastic net regression
L1 and L2 used, best of both worlds
Pros
Cons
Understand the place of LASSO regression within association and prediction modeling for binary outcomes.
Recognize the process for tidymodels
Understand how penalized regression is a form of model/variable selection.
Perform our classification method on training set
Risk factor/variable of interest: history of prior fracture (PRIORFRAC: 0 or 1)
Potential confounder or effect modifier: age (AGE, a continuous variable)
Crossed out because we are no longer attached to specific predictors and their association with fracture
Training: act of creating our prediction model based on our observed data
When we use data to create a prediction model, we want to test our prediction model on new data

Training set
Testing set
When splitting data, we need to be conscious of the proportions of our outcomes
Is there imbalance within our outcome?
We want to randomly select observations but make sure the proportions of No and Yes stay the same
We stratify by the outcome, meaning we pick Yes’s and No’s separately for the training set
Side note: took out bmi and weight bc we have multicollinearity issues
From package rsample within tidyverse, we can use initial_split() to create training and testing data
strata to stratify by fractureprop to set the proportion of training data<Training/Testing/Total>
<400/100/500>
Rows: 400
Columns: 10
$ priorfrac <fct> No, No, Yes, No, No, Yes, No, Yes, Yes, No, No, No, No, No, …
$ height <int> 158, 160, 157, 160, 152, 161, 150, 153, 156, 166, 153, 160, …
$ premeno <fct> No, No, No, No, No, No, No, No, No, No, No, Yes, No, No, No,…
$ momfrac <fct> No, No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No…
$ armassist <fct> No, No, Yes, No, No, No, No, No, No, No, No, No, Yes, No, No…
$ smoke <fct> No, No, No, No, No, Yes, No, No, No, No, Yes, No, No, No, No…
$ raterisk <fct> Same, Same, Less, Less, Same, Same, Less, Same, Same, Less, …
$ fracscore <int> 1, 2, 11, 5, 1, 4, 6, 7, 7, 0, 4, 1, 4, 2, 2, 7, 2, 1, 4, 5,…
$ fracture <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ age_c <dbl> -7, -4, 19, 13, -8, -2, 15, 13, 17, -11, -2, -5, -1, -2, 0, …
Rows: 100
Columns: 10
$ priorfrac <fct> No, No, No, No, No, No, No, No, Yes, Yes, No, No, No, No, No…
$ height <int> 167, 162, 165, 158, 153, 170, 154, 171, 142, 152, 166, 154, …
$ premeno <fct> No, No, No, Yes, No, Yes, Yes, Yes, Yes, No, No, No, No, No,…
$ momfrac <fct> No, No, No, No, No, Yes, No, No, Yes, No, No, No, No, No, No…
$ armassist <fct> Yes, No, Yes, No, Yes, No, Yes, No, No, No, No, No, No, No, …
$ smoke <fct> Yes, Yes, No, No, No, No, No, No, No, No, No, No, No, No, No…
$ raterisk <fct> Same, Less, Less, Greater, Same, Same, Same, Same, Same, Sam…
$ fracscore <int> 3, 1, 5, 1, 8, 3, 7, 1, 6, 7, 0, 2, 0, 0, 1, 2, 2, 8, 4, 3, …
$ fracture <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, …
$ age_c <dbl> -13, -10, 3, -8, 17, 0, 6, -5, 1, 17, -11, -6, -10, -12, -6,…
Using Lasso penalized regression!
We can simply set up a penalized regression model
glmnet takes the basic fitting of glm and adds penalties!
tidymodels we set an engine that will fit the modelmixture option let’s us pick the penalty
mixture = 0 for Ridge regressionmixture = 1 for Lasso regression0 < mixture < 1 for Elastic net regressionfracture ~ .)step_dummy() so R identifies categorical variables
library(vip)
vi_data_main = glow_fit_main %>%
pull_workflow_fit() %>%
vi(lambda = 0.002) %>% # vi: variable importance
filter(Importance != 0)
vi_data_main# A tibble: 9 × 3
Variable Importance Sign
<chr> <dbl> <chr>
1 raterisk_Greater 0.535 POS
2 momfrac_Yes 0.526 POS
3 priorfrac_Yes 0.485 POS
4 raterisk_Same 0.413 POS
5 smoke_Yes 0.344 NEG
6 premeno_Yes 0.267 POS
7 fracscore 0.196 POS
8 armassist_Yes 0.138 POS
9 height 0.0370 NEG
glow_fit_main %>% tidy() %>% gt() %>%
tab_options(table.font.size = 35) %>%
fmt_number(decimals = 3)| term | estimate | penalty |
|---|---|---|
| (Intercept) | 3.417 | 0.002 |
| height | −0.037 | 0.002 |
| fracscore | 0.196 | 0.002 |
| age_c | 0.000 | 0.002 |
| priorfrac_Yes | 0.485 | 0.002 |
| premeno_Yes | 0.267 | 0.002 |
| momfrac_Yes | 0.526 | 0.002 |
| armassist_Yes | 0.138 | 0.002 |
| smoke_Yes | −0.344 | 0.002 |
| raterisk_Same | 0.413 | 0.002 |
| raterisk_Greater | 0.535 | 0.002 |
# A tibble: 1 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.672
Why is this AUC worse than the one we saw with prior fracture, age, and their interaction?
Prevents overfitting to one set of training data
Split data into folds that train and validate model selection
Basically subsection of training and testing (called validating) before truly testing on our original testing set
Use a tuning parameter for our penalty
Basically, we need to figure out what the best penalty is for our model
We use the training set to determine the best penality
Videos that includes tuning parameter with LASSO
Performing cross-validation
For complete video of machine learning with LASSO, cross-validation, and tuning parameters
See “Unit 5 - Deck 4: Machine learning” on this Data Science in a Box page
You can use purposeful selection, like we did last quarter
If you want to focus on association modeling!
But you will need to include at least one interaction!!
A good way to practice this again if you struggled with it previously
You can try out LASSO regression
Lesson 14: Model Building