gapm2 = gapm2 %>%
mutate(income_levels = factor(income_levels,
ordered = T,
levels = c("High income",
"Upper middle income",
"Lower middle income",
"Low income")))Muddy Points
Lesson 5: Categorical Covariates
Muddy Points 2026
1. I have a quick question about your “quick hypothesis test” using the outcome of the CIs. We went over the example where the CI for Partly Free spanned 0, so that’s an indicator that it’s not significant. Is that only the case in reference to Not Free? That doesn’t say anything about if it’s significantly different than Free, right?
Great question! Yes, the CI for Partly Free only tells us if it’s significantly different from Not Free (the reference group). To compare Partly Free to Free, we would need to either change the reference group to Free (or Partly Free) or conduct a post-hoc test that compares those two groups directly.
2. the purpose of indicator function, seems to be evaluating yes or no such that yes would be partly free or no would be either not free or free.
Yep, that’s exactly right! That’s why we need an indicator function (or dummy variable) to represent each category (aside from the reference group) in the regression model. That way we have a way to represent each category!
3. Which direction is the difference taken to find the coefficient for other levels other than the referent? Is it always referent mean minus other level or the other way around, or is it the absolute value?
The difference will always be the other group (indicator group) minus the reference group.
The only way we get a positive coefficient (like \(\widehat{\beta}_2\)) is by subtracting the reference group from the indicator group. Here’s the math that I wrote in class:
\[ \widehat{LE}_{PF} - \widehat{LE}_{NF} = (\widehat{\beta}_0 + \widehat{\beta}_1) - \widehat{\beta} = \widehat{\beta}_1 \]
4. augment() function, when do you recommend using it? why do you not need linearity if you just want to do reference cell coding
The augment() function is super useful when you want to add model-related information (like fitted values, residuals, etc.) back to your original data frame. This can be helpful for diagnostics or visualizations.
We’ll cover this function next week!
5. I’m confused about residuals - in the class example, are there 3 different groups of residuals that get combined? or how is there one plot of residuals that represents the difference between LE and each freedom status group’s mean life expectancy.
Great question! In the case of categorical variables, the residuals are calculated based on the difference between the observed values (like LE) and the predicted values (each freedom status group’s mean life expectancy) from the model. Since each category has its own predicted value (mean life expectancy for each freedom status), the residuals will reflect how far each observation is from its respective category mean. BUTTTT oce we calculate that distance (the residual), it does not matter which category it came from when we plot the residuals. They are all just differences from their predicted values.
6. On page 40 of the slides, “Is the variable part of the main relationship that you are investigating? (even if linearity holds)” I was unable to follow what you were talking about at this point.
Yeah, this may not make a lot of sense until we get more into multiple linear regression (MLR).The idea is that even if a variable meets the assumptions (like linearity), we still need to consider if it is relevant to the research question or main relationship we are investigating. In MLR, we’ll have several variables in the model, but we may not need to interpret every relationship with the outcome.
7. This part in particular “Is the variable part of the main relationship that you are investigating? (even if linearity holds) If yes, consider leaving as reference cell coding unless the interpretation makes sense.” what does the “unless the interpretation makes sense” part mean
8. I was a little confused on why we would choose to keep an ordinal variable using referent coding. Would this depend on whether the finding was linear or not (i.e. keeping it referent might make more sense if there was a clear upward relationship with few outliers)?
9. when we were going over ordinal coding/scoring does it matter which score is assigned what number for beta not like is it always going from 0 down to 3 or can it be opposite. How does that impact the income level.
Muddy Points from 2025
1. I was a little confused on the part where if each categorical variable, with their respective betas, have each their own residuals for their respective category?
Residuals are always tied to the expected outcome. Since each category has a different expected outcome (based on their respective estimated coefficients), the countries from same category will have the same expected outcome (in SLR). Thus, for countries in the same category, their residuals (\(\widehat{\epsilon} = Y_i - \widehat{Y_i}\)) will all have the same \(\widehat{Y_i}\)
2. Factors: What was meant by in order to change the reference level, we need to convert it to data type factor? / Can you quickly explain what a factor is and why we had to convert to a factor to change the reference value?
Good question! Factors is one of the coding options in R for categorical variables. Different from characters or strings, factors allow you to attach specific attributes to the variable. This includes assigning order to the categories and setting reference levels.
3. I’m not really sure how using linear models on categorical variables is useful since you can’t make predictions like you can with continuous data
You can make predictions! “Predictions,” for how we used it with continuous predictors/covariates, is just the expected outcome for a given X. For categorical covariates, the expected outcome given X is the mean of each categorical group.
With only one variable in the model, it might feel more appropriate to use something like the ANOVA table, but we do not typically have only one variable in the model. This is just to help us set up the foundation of linear regression and understand categorical covariates in our model before we move to multiple linear regression.
Muddy Points from 2024
1. Why do we need to create a new variable for ordinal / scoring?
Otherwise R will treat income as non-ordinal, and use the default reference cell coding. So if we want our variables to be scored (and numeric) then we must put it in a form R can recognize.
2. I’m a little confused on how the R code works for recoding/reordering our variables, specifically 1) why we use the mutate function but then use the same name for the variable/how that works and 2) why you need to include the list of each variable name in a vector. Basically, what each piece of that code does exactly and why it’s needed.
Mutate is just a function to create/change a variable. So if we are not fundamentally changing any aspect of the variable, we can call it by the same name. Helps keep our data frame neat by not tacking on additional variables.
When I am including the list of levels I am giving R the exact order to read each level. So if I want to go from high income to low income, I would reset the levels to the below code. Then R would read high income as the first level.
3. Is there a rationale or strategy in choosing the most appropriate reference group?
Often no, not if the groups are not ordered. Things that you may consider:
Is there a central group that you want to make comparisons to?
Is there any social consequences of continually centering comparisons to one group? We may be consequentially centering the narrative around that group.
When we interpret the coefficients, is there one group as the reference that makes it a little easier to interpret? (this has more of an effect in 513)
4. How do we build the regression indicators?
In R, we don’t need to build the indicators. If we have a variable that is a facotr with mutually exclusive groups, then R will automatically create the indicators within the lm() function.