2025-02-24


Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
Model selection: picking the “best” model from a set of possible models
Model selection strategies: a process or framework that helps us pick our “best” model
First, let’s think about the number of observations in our dataset
Extreme example: In the Gapminder dataset, I can use an indicator for each country
Remember that each country is an observation
So we have a perfectly fit model - a covariate for each observation
But we cannot generalize this to any other countries
And we haven’t identified any meaningful relationships between life expectancy and other measured characteristics
More covariates in the model is not always better
Suppose we have \(p = 30\) covariates (in the true model) and n = 50 observations. We could consider the following two alternatives:
Recall mean square error is a function of SSE (sum of squared residuals)
\[ MSE = \dfrac{1}{n} \sum_{i=1}^{n} \big(Y_i - \widehat{Y}_i \big)^2 = \dfrac{1}{n} SSE \]
MSE can also be written as a function of the bias and variance
\[ MSE = \text{bias}\big(\widehat\beta\big)^2 + \text{variance}\big(\widehat\beta\big) \]
For the same data:
More covariates in model: less bias, more variance
Less covariates in model: more bias, less variance
Our goal: find a model with just the right amount of covariates to balance bias and vairance
“Because models always fall far short of the complex reality under study, there are no best or optimal strategies for modeling.”
Not all statistical texts provide practical advice on model development
Model development strategy should align with research goals
Association / Explanatory / One variable’s effect
Goal: Understand one variable’s (or a group of variable’s) effect on the response after adjusting for other factors
Mainly interpret the coefficient of the variable that is the focus of the study
Any variables not selected for the final model have still been adjusted for, since they had a chance to be in the model
Example: How is body mass of a penguin associated with flipper length?
Prediction
Goal: to calculate the most precise prediction of the response variable
Interpreting coefficients is not important
Choose only the variables that are strong predictors of the response variable
Example: What is the flipper length of a penguin with body mass of 3000 g (and all its other characteristics)?
More information on the two analysis goals:
If you ever get the chance, check out Dr. Kristin Sainani’s series on Statistics
Association / Explanatory / One variable’s effect
Selection of potential models is tied more with the research context with some incorporation of prediction scores
Pre-specification of multivariable model
Purposeful model selection
Change in Estimate (CIE) approaches
Prediction
Selection of potential models is fully dependent on prediction scores
Automated strategies
For categorical outcomes, there are more prediction model selection strategies (will learn more in BSTA 513)
In a clinical trial, we often have to write and finalize a statistical analysis plan (SAP) before the trial starts
If we wish to compare treatment effects adjusted for covariates, all covariates typically specified in advance
Example: Comparing effectiveness of 3-drug vs. 2-drug regimen for delaying AIDS onset or death. Covariates such as severity of HIV infection at baseline would have been specified in advance.
Variables such as study site, as well as any randomization stratification variables are common covariates.
Partly to make sure that we are not adding in variables that make our regimen significant
In these cases, only a limited number of multivariable models are fit and reported
Can use this type of model selection for any type of regression
Careful, well-thought out variable selection process
Often a reasonable strategy, especially in epidemiology and more exploratory clinical studies
CIE strategies select covariates on the basis of how much their control changes exposure effect estimates
What estimate?
What magnitude change is ”important”?
One must choose an effect measure to judge change importance, where “importance” needs to be evaluated along a contextually meaningful scale
Accurate assessment of confounding may require examining changes from removing entire sets of covariates
This is an incredibly common approach that statisticians use, often because it is an older and more recognized method
Major disadvantages to stepwise selection:
Predictors/covariates are added or removed one at time if they are below a certain threshold (usually p-value below 0.10 to 0.20)
I will introduce two of the approaches so that you understand the general process if a collaborator ever mentions stepwise selection
Forward selection:
For \(p\) covariates potential covariates, run all simple linear regressions:
Now run \(Y= \beta_0 + \beta_1 X_i + \beta_2 X_1 + \epsilon\) through \(Y= \beta_0 + \beta_1 X_i + \beta_2 X_{p} + \epsilon\) and enter the next \(X_j\) with the lowest p-value
Continue process until no more predictors come back with a p-value below the threshold
Backward selection:
I don’t see this approach very often
Quite literally making subsets of the data and using the “best” one
General steps:
Major disadvantages to best subset:
| LASSO (Least About Shrinkage and Selection Operator) | Ridge | Elastic Net | |
| Penalization | L-1 Norm, uses absolute value | L-2 Norm, uses squared value | Best of both worlds, L-1 and L-2 used |
| Pro’s | Reduces overfitting, will shrink coefficient to zero | Reduces overfitting, handles collinearity, can handle k>n | Reduces overfitting, handles collinearity, handles k>n, shrinks coefficients to zero |
| Con’s | Cannot handle k>n, doesn’t handle multicollinearity well | Does not shrink coefficients to zero, difficult to interpret | More difficult for R to do than the other two (but not really that bad) |
So far we have compared models using the F-test
The F-test is a great way to compare models that are nested
What if we want to compare models that are not nested?
There is a special group of fit statistics that can help us compare models
Note: these are sometimes used in the model building process (within one strategy)
Helpful if we want to compare selected models across strategies
Helpful if we have a few “final” models with different covariates that we want to compare
The following model fit statistics combine information about the SSE, the number of parameters in the model, and the sample size
For Mallow’s Cp, AIC, and BIC: smaller values indicate better model fit!
For Adjusted R-squared: larger values indicate better model fit!
| Fit statistic | Equation | R code |
|---|---|---|
| R-squared / Adjusted R-squared | \(Adj. R^2 = 1 - \frac{SSE/(n-p-1)}{SSY/(n-1)}\) | Within summary(model_name) |
| Mallow’s \(C_p\) | \(C_p = \Bigg[ \dfrac{\widehat\sigma^2_p}{\widehat\sigma^2_{max}} - 1 \Bigg](n-p) + p\) | ols_mallows_cp() |
| Akaike information criterion (AIC) | \(AIC = n\log(SSE) - n \log(n) + 2(p+1)\) | AIC(model_name) |
| Bayesian information criterion (BIC) | \(BIC = n\log(SSE) - n\log(n) + log(n)\cdot(p+1)\) | BIC(model_name) |
There is no hypothesis testing for these fit statistics
Only helpful if you are comparing models
Works for nested and non-nested models
Common to report all or some of them
All of the fit statistics will not necessarily reach a consensus about the best fitting model
https://www.researchgate.net/figure/Model-Fit-Statistics_tbl1_308844501
library(olsrr)
model1 = gapm_sub %>% lm(formula = LifeExpectancyYrs ~ FemaleLiteracyRate)
model2 = gapm_sub %>% lm(formula = LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD)
model3 = gapm_sub %>% lm(formula = LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD +
income_levels + four_regions)
fit_stats = data.frame(Model = c("Model 1: FLR only", "Model 2: FLR + FS", "Model 3: FLR + FS + Income + WR"),
adj_R_2 = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared,
summary(model3)$adj.r.squared),
Cp = c(ols_mallows_cp(model1, model3), ols_mallows_cp(model2, model3),
ols_mallows_cp(model3, model3)),
AIC = c(AIC(model1), AIC(model2), AIC(model3)),
BIC = c(BIC(model1), BIC(model2), BIC(model3)))
colnames(fit_stats) = c("Model", "Adj. R-squared", "Mallow's Cp", "AIC", "BIC")
fit_stats %>% gt() %>% tab_options(table.font.size = 40) %>% fmt_number(decimals = 3)| Model | Adj. R-squared | Mallow's Cp | AIC | BIC |
|---|---|---|---|---|
| Model 1: FLR only | 0.413 | 88.126 | 470.066 | 476.896 |
| Model 2: FLR + FS | 0.550 | 51.942 | 451.872 | 460.979 |
| Model 3: FLR + FS + Income + WR | 0.737 | 9.000 | 418.724 | 441.491 |
Lesson 13: Model selection