Muddy Points

Lesson 13: Model Selection

Modified

February 28, 2025

Muddy Points from n2025

1. I am a little stuck model selection strategies. I just want clarification on which one is the “best one”, and why?

There is no “best one.” It really just depends on your research goals and preferences. Different situations may lead you to follow different model selection strategies. Each model selection strategy will guide you towards a “best model,” but that depends on what the model selection strategy prioritizes.

2. For fit statistic, what is the best range or number that is “accurate?”

There is no best range. The ranges depend on the sample you have. Therefore, we only compare relative numbers between models that use the same data.

3. Just checking straight that with association usually some common issues is the too few variables, higher bias and lower variance compared to prediction.

Yes! In association, we do not want to overcomplicate the model because that will mess with interpretability of the model. Therefore, we lean towards less variables, which leads to higher bias potential and less variance of our coefficient estimates.

4. How do I actually choose which model to go with from the common model fit statistics?

I mostly go with the model with the best values (lowers AIC and BIC and higher \(R^2\)). A lot of the times, I use it if my model selection technique landed me on a really complicated model. Maybe I did not expect a bunch of interactions in the model, but that was the final model. I might quickly compare the model fit statistics of the model without interactions (or with different interactions) to compared the model fits.

Muddy Points 2024

1. In CIE, why this assumes a 10% difference indicates confounding and not effect measure modification

A 10% difference can also indicate an interaction, but we would need to see the difference when we include the interaction. I think I was specifically talking about including variables, which typically means I am only including their main effects. With main effects, we can test for confounding, but not interactions. That’s all! We can separately test for interactions using the same 10% criterion.

2. “Any variables not selected for the final model have still been adjusted for, since they had a chance to be in the model” How are they adjusted for when they aren’t?

I feel your confusion. These variables have not been explicitly adjusted for in the model, more implicitly. If a variable is not selected that means it does not predict our outcome well and/or it does not affect other variables’ relationship with the outcome. That means we tried to adjust for it, but it has no effect, so we have technically adjusted for something that does not change our model.

An example is probably best. Let’s say we’re looking at treatment effect. We measured the weather on the day that someone went in for treatment. We have a variable on whether it was cloudy or not. That variable was not selected in the model because the cloudiness has no effect on the treatment, but we allowed it to be a potential covariate. This is different than a variable that may not have an effect but was not measured. We cannot say we adjusted for something like shirt color during treatment because we haven’t actually tested it.

3. Difference between stepwise and change in estimate approach?

Stepwise is an automatic selection process that only requires us to put our dataset in a function which will return the “optimal” model. It also based only on the p-values of coefficients in the model. In CIE, we are manually including and excluding variables, and checking for a change in a coefficient estimate, instead of a significant p-value. A big change in a coefficient estimate is not necessarily accompanied by a significant p-value.

4. How does having fewer covariates cause a more biased estimate? (and what does it mean for \(\widehat\beta\) to be biased for \(\beta\)?)

Fewer covariates in our model means we likely not capturing the complex relationship between our outcome and our variables. If we leave out a variable that is an important predictor of the outcome, then the coefficients of all the variables that made it into the model will be a little biased. (Because we are not capturing the true, underlying model).

For example, let’s say I am analyzing data for a study on dementia. Dementia is my outcome and I include a few variables in my model, such as whether or not you live with someone, depression, and physical activity. However, I leave out age, which is known to have high association with dementia. I have left out an important variable that may be a confounder or effect modifier of the variables in the model. Thus, the estimates of coefficients in the model will be biased.

The less variables in the model, the more likely we are leaving out a variable that would help predict our outcome. We can counter this by trying to select the best model!

Second part of the question: \(\widehat\beta\) is a biased estimate for \(\beta\) means that the estimated value, \(\widehat\beta\), is not close to the true, underlying \(\beta\). We work under the assumption that there is some true relationship between our covariates and our outcome, and we are trying to uncover the true value by estimating it. However, our estimate may not be close to the true value. We can try to get it as close as possible given our research aims and model.

5. All the new approaches for model selection!

Just to be clear, the main intention for the overview was that you can identify and recognize some of the key characteristics of different model selection strategies. We can’t cover all of them in detail, but I just want you to know what’s out there, and what other people might use.