Lab 4 Instructions

BSTA 512/612

Author

Nicky Wakim

Modified

March 11, 2026

Caution

Ready to work on! (Nicky 2/19/26)

1 Directions

Please turn in your .html file on Sakai. Please let me know if you greatly prefer to submit a physical copy.

You can download the .qmd file for this lab here. Please use the linked qmd file and not this one! (This is specifically the instructions.)

The rest of this lab’s instructions are embedded into the lab activities.

1.1 Purpose

The main purpose of this lab is to perform model selection through purposeful selection, identify one or more potential final models, and start our interpretation of our main relationship. We will walk through the steps of purposeful model selection.

1.2 Grading

This lab is graded out of ___ points. Each lab will follow the specific rubric on the Project page.

2 Lab activities

Before starting this lab, make sure you load your dataset from Lab 3.

2.1 Restate research question

Task

Please restate your research question from Lab 1.

2.2 Step 1: Simple linear regressions / analysis

We have done most of this step through visualizations in Lab 2 and 3. Now, we will quickly run a simple linear regression model for each covariate against the IAT score (outcome). Remember, the goal of this is to see if each covariate explains enough variation of the outcome, IAT score. You should have 15 simple linear regression models and their results (15 SLR models: models with 1 main variable + 4 demographic variables + 8 race/ethnicity indicators + 3 additional variables). Results include the F-statistic and p-value from the test if each covariate explains enough variation of the outcome. Please revisit the slides from Lesson 5 (SLR: More inference + Evaluation) for more help with this test.

Here are 2 options to running and outputting the results of

Option 1 (longer, but uses code we are more familiar with): We can run lm() for each covariate in separate lines of code, and use something like summary() or anova() to look at the results of each. (More time consuming to write, but less complicated coding)
Option 2 (shorter, but using newer code): We can use add1() to add each variable at a time to a model with just the intercept. The output will give us a table with the F-statistic and p-value for each variable.

Tasks

Run a simple linear regression model for each covariate against the IAT score (outcome).
Display results from the test if each covariate explains enough variation of the outcome. This may be from three options in the instructions: summary()/anova() only or the output from add1().

Decisions based on this output will be in the next step.

2.3 Step 2: Preliminary variable selection

Using the previous p-values from the F-test on each covariate’s SLR, decide which covariates will be included in the initial model. Recall the decision rule: we keep covariates that explain enough variation using p-value < 0.25. Note that because our sample size is so large, the p-values might be really small. For now, that’s okay, but this means we may want to alter our Step 3 a little bit.

Once you have decided on the covariates, run the multiple linear regression model and display the regression table.

Tasks

Decide which covariates will be included in the initial model and list them.
Run the initial model and display the regression table.

No need to write out the model, but you may in addition to the list.

2.4 Step 3: Assess change in coefficient

Now that all the selected variables are in one initial model, we can start considering the effect of each variable (outside of our main research question).

Remember our general rule: We can remove a variable if (1) p-value > 0.05 for the F-test to include or exclude the variable and (2) change in coefficient (\(\Delta\%\)) of our explanatory variable is < 10%. Please remember that the p-values for the F-test for a multi-level categorical variable must be calculated by creating a reduced and full model.

It might be helpful to copy your list of covariates here and make note of the ones that you are removing. It was hard for me to keep track of all the variables when our dataset contains sooo many categorical covariates, and the regression table is so long.

Since our sample size is quite large, most (if not all) of the F-tests will conclude that the variable should be kept in the model. At this point, I advise that you turn to some common sense and the change in coefficients.

For common sense, you may notice that some of your covariates are essentially measuring the same thing. If there is clinical/social relevance to having both in the model, then keep them in, but if not, you will have to decide which is more interpretable/relevant/aligned with your research question. For example, if you chose variables involving attitudes and beliefs that are measuring similar things, then you might exclude one. There are measurements like “I am …” with relative weight groups and “Compared to most…” with relative weight groups. These two might capture a lot of the same information, so we may chose one. (Additionally, this might create issues with multicollinearity, which we will discuss on the last day, so just keep that in mind!)
For change in coefficients, focus on the variable of your research question. Does the removal of variables change the coefficients for your main independent variable? Remember what we discussed with change in coefficients when our explanatory variable is a multi-level categorical variable (if any category indeictors change by more than 10%, then keep the variable). You may find these changes small, which tracks with a lot of our plots in Lab 3. Nothing seemed to have such a big effect on IAT score, and as a consequence it’s hard to see big changes for a potential confounder.

Note that I put common sense first. The change in coefficients may not be very large, and may lead you to think we don’t need a lot of the variables in our model. However, I would let common sense override the change in coefficients if your reasoning is well justified.

psst… There might be some code in Step 4 that might help you get started in this step.

Tasks

Remove variables from the initial model based on your common sense, change in coefficient, and/or p-values of the F-tests.

If more than 50% of the coefficients for your main variable change by more than 10% when you remove a variable, then you should keep that variable in the model.

You do NOT need to show all your work here. You just need to include:

A brief explanation of what variables were dropped and why (a sentence per variable), and
An example of your process with one variable is enough (including code that you ran)

2.5 Step 4: Assess scale for continuous variables

There is one variable in our model that is continuous: Age. We need to assess the scale for age. In this step we will have ZERO delivarables. To save you time, I will walk you through my thought process, and why I determined age is fine as is. If you still want to try something else out with age, then you can!

First, we can start with a scatterplot of IAT score and age. Your plot may look a little different than mine.

ggplot(data = iat, aes(x = age, y = iat_score)) +
  geom_point(size = 0.8) + geom_smooth() + xlim(0, 111)

In the above scatterplot, it looks like the relationship is mostly linear (and increasing) until we get to approximately 80 years old. At that point IAT score decreases with age. Let’s say we 100% believe there is suddenly more respondents around 100-110 years old than 90-100 year olds. I’m already skeptical of this since we did a quality control in Lab 3. We’ll play it out because it’s not worth making judgement calls on what we consider “admissable” data.

We could do quantiles, splines, or polynomials, but those approaches will either make more categorical variables or make the relationship between age and IAT score harder to interpret. We have a pretty linear relationship up until the higher ages!

I wanted to investigate the linearity a little more so I created an indicator for individuals who are 100 years or older:

iat1 = iat %>% mutate(ind_age_100 = ifelse(age > 100, "TRUE", "FALSE"))

Now I can see if the linearity differs between the two groups of ages:

ggplot(iat1, aes(x = age, y = iat_score, color = ind_age_100)) +
  geom_point(size = 0.8) + geom_smooth(method = "lm")

Looks like the slope is positive for the lower ages and negative for the higher ages… I can test to see if including the indicator makes a big enough change in the coefficients for my main variable.

Here’s the model without the indicator:

prelim_model = iat1 %>% 
  lm(
    formula = iat_score ~ 
      iam_f + 
      re_asian + re_black + re_hispanic + re_me + 
      re_native + re_other + re_pacific + re_white +
      gender_f +
      trans_f +
      pol_id_f +
      control_other_f + 
      edu_f + 
      age
  )

And we’ll take a look at the coefficients for the model:

prelim_model$coefficients[2:7] # by using c(2:6, 46) I am telling R to

             iam_fModerately overweight                iam_fSlightly overweight 
                             0.10143967                              0.11541675 
iam_fNeither underweight nor overweight               iam_fSlightly underweight 
                             0.07212563                              0.12709618 
            iam_fModerately underweight                   iam_fVery underweight 
                            -0.06070197                              0.09556746

Then we can run the model with the indicator, then look at the coefficients:

prelim_model2 = iat1 %>% 
  lm(
    formula = iat_score ~ 
      iam_f + 
      re_asian + re_black + re_hispanic + re_me + 
      re_native + re_other + re_pacific + re_white +
      gender_f +
      trans_f +
      pol_id_f +
      control_other_f + 
      edu_f + 
      age +
      ind_age_100
  )
prelim_model2$coefficients[2:7] # by using c(2:6, 46) I am telling R to

             iam_fModerately overweight                iam_fSlightly overweight 
                             0.10499961                              0.11877881 
iam_fNeither underweight nor overweight               iam_fSlightly underweight 
                             0.07350735                              0.13089165 
            iam_fModerately underweight                   iam_fVery underweight 
                            -0.06120410                              0.10213060

                                       # only print certain variables' coefficients

We can check the % change in the coefficients between the models.

Recall, \[ \Delta\% = 100\% \cdot \frac{\widehat\beta_{FLR, full} - \widehat\beta_{FLR, red}}{\widehat\beta_{FLR, full}} \] Here’s how I quickly do it with the coefficients:

100 * ( prelim_model2$coefficients[2:7] - prelim_model$coefficients[2:7] ) /
  prelim_model2$coefficients[2:7]

             iam_fModerately overweight                iam_fSlightly overweight 
                              3.3904289                               2.8305179 
iam_fNeither underweight nor overweight               iam_fSlightly underweight 
                              1.8797068                               2.8997036 
            iam_fModerately underweight                   iam_fVery underweight 
                              0.8204111                               6.4262228

Based on %’s above, it doesn’t look like the indicator makes much of a difference in my model. It is likely because there are only 190 individuals over the age of 100 and 113822 individuals under the age of 100 (in my dataset). Those 190 individuals will not have a big impact on the linear relationship between age and IAT, even though the first smoothed scatterplot made it look like it does.

To bring this point home, I can plot age and IAT with and without the individuals that are 100 years or older. Let me know if you find a better way to overlay these plots! (I have been a little stressed on time, and couldn’t find a quick answer.)

ggplot() +
  geom_point(data = iat1, aes(x = age, y = iat_score)) + 
  geom_smooth(data = iat1, aes(x = age, y = iat_score), method = "lm", color = "red") + 
  geom_smooth(data = iat1 %>% filter(age < 100), aes(x = age, y = iat_score), method = "lm", color = "blue")

The red line is the best fit line for all data and the blue is only for individuals 100 years old or younger. The difference is not much… Thus, I think it’s okay to leave age as is!!

Tasks

No tasks here! If you want to try out what I did above, you can!

2.6 Step 5: Check for interactions

Now we’re going to check if there are any interactions. We will use add1() as described in Lesson 14 (in Step 5). From the F-statistic and p-value, we can determine if the interaction should be included in the model.

Note

I’ve made a change to the Task below!!! I originally said to check for interactions between the main variable and each covariate, but that is overwhelming. If the interaction from your subquestion is significant, then you should include that one, but don’t include all the significant interactions.

Tasks

Using add1(), determine what interactions are significant. If your variable from your subquestion is significant, then add it to your preliminary model.
Run the preliminary final model that includes the main effects and potential interaction.

2.7 Step 6: Assess model fit

At this point we may want to compare different models. While Steps 1-5 have been directing us towards a single model, you may have been interested in other models along the way. Maybe there were some interactions that you thought were interesting, but didn’t think of before. Maybe you would like to combine different groups for categorical variables. Maybe your model feels really complicated and you want to look into a simpler model.

If you are completely happy with your model, then you don’t have to do this step.

You might create a table like such:

sum = summary(prelim_model)
model_fit_stats = data.frame(Model = "Preliminary main effects model",
                             Adjusted_R_sq = sum$adj.r.squared, 
                             AIC = AIC(prelim_model), 
                             BIC = BIC(prelim_model))

model_fit_stats

                           Model Adjusted_R_sq      AIC      BIC
1 Preliminary main effects model    0.05398472 110793.6 111227.6

Tasks

Optional: Create a table that displays some fo the model fit statistics to compare preliminary final models.

2.8 Create a forest plot or regression table of your coefficient estimates

It’s often helpful to have a visualization of coefficient estimates. Forest plots are a nice way to show all the values together. You can also make a clean regression table.

Below I have started a forest plot using my prelim_model. You can make the plot with your final model. Note that my categories are out of order (I should’ve fixed that in an earlier lab!!)

library(broom.helpers)
model_tidy = tidy_and_attach(prelim_model, conf.int=T) %>%
  tidy_remove_intercept() %>%
  tidy_add_reference_rows() %>% 
  tidy_add_estimate_to_reference_rows() %>%
  tidy_add_term_labels() %>%
  mutate(var_label = case_match(var_label, 
                               "age" ~ " ", 
                               "How much control do people have over their weight?" ~ "How much control do people \n have over their weight?",  
                               "What is your race or ethnicity? (Participant selected Asian)" ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)", 
                               "What is your race or ethnicity? (Participant selected Black or African American)"  ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)", 
                               "What is your race or ethnicity? (Participant selected Hispanic, Latino or Spanish)"  ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)",
                               "What is your race or ethnicity? (Participant selected Middle Eastern or North African)"  ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)",
                               "What is your race or ethnicity? (Participant selected American Indian or Alaska Native)" ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)",
                               "What is your race or ethnicity? (Participant selected Some other race or ethnicity)" ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)", 
                               "What is your race or ethnicity? (Participant selected Native Hawaiian or other Pacific Islander)"  ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)", 
                               "What is your race or ethnicity? (Participant selected White)" ~ "What is your race or \n ethnicity? (Pariticipant can \n select more than one)", 
                               "Political Identity data with a 7-point scale" ~ "Political Identity \n (7-point scale)",
                               .default = var_label
                         )) %>%
  mutate(label = case_match(label, 
                            "What is your race or ethnicity? (Participant selected Asian)" ~ "Asian", 
                            "What is your race or ethnicity? (Participant selected Black or African American)" ~ "Black or African American",
                            "What is your race or ethnicity? (Participant selected Hispanic, Latino or Spanish)" ~ "Hispanic, Latino or Spanish",
                            "What is your race or ethnicity? (Participant selected Middle Eastern or North African)" ~ "Middle Eastern or North African",
                            "What is your race or ethnicity? (Participant selected American Indian or Alaska Native)" ~ "American Indian or Alaska Native",
                            "What is your race or ethnicity? (Participant selected Some other race or ethnicity)" ~ "Some other rate or ethnicity", 
                            "What is your race or ethnicity? (Participant selected Native Hawaiian or other Pacific Islander)" ~ "Native Hawaiian or other Pacific Islander",
                            "What is your race or ethnicity? (Participant selected White)" ~ "White",
                            "age" ~ "Age (years)",
                            .default = label
  )) %>%
    mutate(label = fct_rev(fct_inorder(label))) # This line will order your labels in the order of your data that went into the lm function

forest_plot = ggplot(data=model_tidy, aes(y=label, x=estimate, xmin=conf.low, xmax=conf.high)) + 
  facet_grid(rows = vars(var_label), scales = "free",
             space='free_y', switch = "y") + 
  geom_point(size = 2) +  geom_errorbarh(height=.25) + 
  geom_vline(xintercept=0, color='#C2352F', linetype='dashed', alpha=1) +
  theme_classic() +
  labs(x = "Beta", y = "Variables") +
  theme(axis.title = element_text(size = 10), axis.text = element_text(size = 10), 
        title = element_text(size = 10), strip.placement = "outside", 
        strip.text.y.left = element_text(size = 10, angle = 0), 
        strip.background = element_blank())
forest_plot

Here are some other packages for forest plots:

Tasks

Create a forest plot or regression table using tbl_regression() to visualize the coefficient estimates.