Muddy Points

Lesson 5: Categorical Covariates

Modified

January 29, 2025

Muddy Points from 2025

1. I was a little confused on the part where if each categorical variable, with their respective betas, have each their own residuals for their respective category?

Residuals are always tied to the expected outcome. Since each category has a different expected outcome (based on their respective estimated coefficients), the countries from same category will have the same expected outcome (in SLR). Thus, for countries in the same category, their residuals (\(\widehat{\epsilon} = Y_i - \widehat{Y_i}\)) will all have the same \(\widehat{Y_i}\)

2. Factors: What was meant by in order to change the reference level, we need to convert it to data type factor? / Can you quickly explain what a factor is and why we had to convert to a factor to change the reference value?

Good question! Factors is one of the coding options in R for categorical variables. Different from characters or strings, factors allow you to attach specific attributes to the variable. This includes assigning order to the categories and setting reference levels.

3. I’m not really sure how using linear models on categorical variables is useful since you can’t make predictions like you can with continuous data

You can make predictions! “Predictions,” for how we used it with continuous predictors/covariates, is just the expected outcome for a given X. For categorical covariates, the expected outcome given X is the mean of each categorical group.

With only one variable in the model, it might feel more appropriate to use something like the ANOVA table, but we do not typically have only one variable in the model. This is just to help us set up the foundation of linear regression and understand categorical covariates in our model before we move to multiple linear regression.

Muddy Points from 2024

1. Why do we need to create a new variable for ordinal / scoring?

Otherwise R will treat income as non-ordinal, and use the default reference cell coding. So if we want our variables to be scored (and numeric) then we must put it in a form R can recognize.

2. I’m a little confused on how the R code works for recoding/reordering our variables, specifically 1) why we use the mutate function but then use the same name for the variable/how that works and 2) why you need to include the list of each variable name in a vector. Basically, what each piece of that code does exactly and why it’s needed.

Mutate is just a function to create/change a variable. So if we are not fundamentally changing any aspect of the variable, we can call it by the same name. Helps keep our data frame neat by not tacking on additional variables.

When I am including the list of levels I am giving R the exact order to read each level. So if I want to go from high income to low income, I would reset the levels to the below code. Then R would read high income as the first level.

gapm2 = gapm2 %>%
 mutate(income_levels = factor(income_levels, 
            ordered = T, 
            levels = c("High income", 
                       "Upper middle income", 
                       "Lower middle income", 
                       "Low income")))

3. Is there a rationale or strategy in choosing the most appropriate reference group?

Often no, not if the groups are not ordered. Things that you may consider:

Is there a central group that you want to make comparisons to?
Is there any social consequences of continually centering comparisons to one group? We may be consequentially centering the narrative around that group.
When we interpret the coefficients, is there one group as the reference that makes it a little easier to interpret? (this has more of an effect in 513)

4. How do we build the regression indicators?

In R, we don’t need to build the indicators. If we have a variable that is a facotr with mutually exclusive groups, then R will automatically create the indicators within the lm() function.