Lesson 5: SLR-ish: Categorical Covariates

Nicky Wakim

2026-01-21

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables
Write the regression equation for a categorical variable using reference cell coding
Calculate and interpret coefficients for reference cell coding
Change the reference level in a categorical variable for reference cell coding
Create new variables and interpret coefficient for ordinal / scoring coding

Why “SLR-ish”?

The strict definition of simple linear regression: only two variables that are BOTH continuous

Common (but kinda wrong) use of simple linear regression: only two variables with outcome continuous and predictor not specified

I’m including multi-level categorical covariates in SLR mostly because it’s easier to learn now!

Let’s map that to our regression analysis process

Model Selection

Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models

Model Fitting

Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions

Model Evaluation

Evaluation of model fit
Testing model assumptions
Residuals
Transformations
Influential points
Multicollinearity

Model Use (Inference)

Inference for coefficients
Hypothesis testing for coefficients

Inference for expected $Y$ given $X$

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables

Write the regression equation for a categorical variable using reference cell coding
Calculate and interpret coefficients for reference cell coding
Change the reference level in a categorical variable for reference cell coding
Create new variables and interpret coefficient for ordinal / scoring coding

Still looking at Gapminder Life Expectancy data

We will look at life expectancy vs. these Freedom statuss
You can access my codebook here
Gapminder measures freedom status: This is a descriptive text of the real-world rights and freedoms enjoyed by individuals. It is determined by the freedom rating which is the average of political rights and civil liberties ratings.
- Not free (NF)
- Partly free (PF)
- Free (F)
Freedom status is a multi-level categorical covariate: it has three statuses

Note: I am calling the expected life expectancy $\widehat{LE}$
- Previously, I have written $\widehat{\text{life expectancy}}$

Visualizing the relationship

Bad option for visualization:

Code

ggplot(gapm, aes(x = fct_reorder(freedom_status, fs_order), y = life_exp)) +
  geom_point() +
  labs(x = "Freedom status", 
       y = "Country life expectancy (years)",
       title = "Life expectancy vs. Freedom status") +
  theme(axis.title = element_text(size = 20), 
        axis.text = element_text(size = 20), 
        title = element_text(size = 20))

Good option for visualization:

Code

ggplot(gapm, aes(x = fct_reorder(freedom_status, fs_order), y = life_exp)) +
  geom_jitter(size = 1, alpha = .6, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 8, shape = 18) +
  labs(x = "Freedom status", 
       y = "Country life expectancy (years)",
       title = "Life expectancy vs. Freedom status",
       caption = "Diamonds = FS averages") +
  theme(axis.title = element_text(size = 20), 
        axis.text = element_text(size = 20), 
        title = element_text(size = 20))

Used geom_jitter()

Good option for visualization:

Code

ggplot(gapm, aes(x = fct_reorder(freedom_status, fs_order), y = life_exp)) +
  geom_boxplot() +
  labs(x = "Freedom status", 
       y = "Country life expectancy (years)",
       title = "Life expectancy vs. Freedom status") +
  theme(axis.title = element_text(size = 20), 
        axis.text = element_text(size = 20), 
        title = element_text(size = 20))

Linear regression with a categorical covariate

When using a categorical covariate/predictor,
- We do NOT, technically, find a best-fit line
Instead we model the means of the outcome
- For the different levels of the categorical variable

In 511/525, we used Kruskal-Wallis test and our ANOVA table to test if groups means were statistically different from one another
We can do this using linear models AND we can include other variables in the model

There are different ways to code categorical variables

Reference Cell Coding

Sometimes called dummy coding
Compares each level of a variable to the omitted (reference) level

Effect coding

Sometimes called sum coding or deviation coding
Compares deviations from the grand mean
Not covered in our class

Ordinal encoding

Sometimes called scoring
Categories have a natural, even spaced ordering
Linear relationship between levels

If you want to learn more about these and other coding schemes:

Building the regression equation: problem with a single coefficient

Previously: simple linear regression

Outcome $Y$ = numerical variable
Predictor $X$ = numerical variable

The regression (best-fit) line is: \[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]

New: what if the explanatory variable is categorical?

Naively, we could Write: $\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X$

Or, with our variables: \[\widehat{\textrm{LE}} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot \textrm{FS} \]

But what does $\textrm{FS}$ (freedom status) mean in this equation?
- What values can it take? How do we represent each FS?

Note: the above is WRONG

Building the regression equation: how do we map categories to means?

If we only have freedom status in our model and want to map it to an average life expectancy…

We want to create a function that can map each freedom status to life expectancy
- If not free: $\widehat{LE} = 68.99$ years
- If partly free: $\widehat{LE} = 70.39$ years
- If free: $\widehat{LE} = 74.13$ years

Can we make one equation for $\widehat{LE}$ by putting the “if” statements within the equation?

Building the regression equation: Indicator functions

In order to represent each status in the equation, we need to introduce a new function:
- Indicator function:
\[I(X = x) \text{ or } I(x) = \left\{ \begin{array}{@{}ll@{}} 1, & \text{if}\ X = x \\ 0, & \text{else} \end{array}\right. \]
- This basically a binary yes/no if $X$ is a specific value $x$
For example, if we want to identify a country as partly free, we can make:

\[I(FS = \text{partly free}) \text{ or }I(\text{partly free}) = \left\{ \begin{array}{@{}ll@{}} 1, & \text{if}\ FS = \text{partly free} \\ 0, & \text{else} \end{array}\right. \]

Poll Everywhere Question 1

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables

Write the regression equation for a categorical variable using reference cell coding

Calculate and interpret coefficients for reference cell coding
Change the reference level in a categorical variable for reference cell coding
Create new variables and interpret coefficient for ordinal / scoring coding

Building the regression equation: Indicators in our equation

\[\begin{aligned} \widehat{\textrm{LE}} = & 68.99 \cdot I(\text{not free}) + 70.39 \cdot I(\text{partly free}) + \\ &74.13 \cdot I(\text{free}) \end{aligned}\]

However, a linear regression equation still requires an intercept!
- So one of our statuses need to become our “reference” group
- We’ll use “not free” as our reference
- That means we need to adjust all the numbers

\[\begin{aligned} \widehat{\textrm{LE}} = & 68.99 + 1.4 \cdot I(\text{partly free}) + \\ &5.14 \cdot I(\text{free}) \\ \widehat{\textrm{LE}} = & \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{partly free}) + \\ & \widehat\beta_2 \cdot I(\text{free}) \end{aligned}\]

Viewing the regression equation another way

\[\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1 \cdot I(FS = \text{partly free}) + \widehat\beta_2 \cdot I(FS = \text{free})\]

Freedom status	Regression equation for FS	Average Life Expectancy for FS
Not free	$\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 0 + \widehat\beta_2 \cdot 0 \end{aligned}$	$\widehat{\textrm{LE}} = \widehat\beta_0$
Partly free	$\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 1+ \widehat\beta_2 \cdot 0\end{aligned}$	$\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1$
Free	$\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 0 + \widehat\beta_2 \cdot 1\end{aligned}$	$\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_2$

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables
Write the regression equation for a categorical variable using reference cell coding

Calculate and interpret coefficients for reference cell coding

Change the reference level in a categorical variable for reference cell coding
Create new variables and interpret coefficient for ordinal / scoring coding

Interpretation of regression equation coefficients

Remember: expected, mean, and average are interchangeable

Coefficient	Interpretation
	Expected/mean/average life expectancy of countries that are not free
	Difference in mean life expectancy between partly free and not free countries -OR- Mean difference in life expectancy between partly free and not free countries
	Difference in mean life expectancy between free and not free countries -OR- Mean difference in life expectancy between free and not free countries

Coefficient

Interpretation

Expected/mean/average life expectancy of countries that are not free

Difference in mean life expectancy between partly free and not free countries -OR-

Mean difference in life expectancy between partly free and not free countries

Difference in mean life expectancy between free and not free countries -OR-

Mean difference in life expectancy between free and not free countries

Poll Everywhere Question 2

Regression table with `lm()` function

model1 = gapm %>%
  lm(formula = life_exp ~ freedom_status)

tidy(model1, conf.int=T) %>% gt() %>% tab_options(table.font.size = 38) %>% 
  fmt_number(decimals = 2)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	68.99	1.17	58.88	0.00	66.67	71.32
freedom_statusPF	1.40	1.52	0.92	0.36	−1.61	4.40
freedom_statusF	5.14	1.70	3.02	0.00	1.76	8.51

\[\widehat{\textrm{LE}} = 68.99 + 1.4 \cdot I(\text{PF}) + 5.14 \cdot I(\text{F})\]

Which Freedom status did R choose as the reference level?
How you would calculate the mean life expectancies of freedom statuss using only the results from the regression table?

Bringing in the numbers/units/95% CI

Coefficient	Interpretation
$\widehat{\beta}_0$	Average life expectancy of countries that are not free is 68.99 (95% CI: 66.67, 71.32).
$\widehat{\beta}_1$	The difference in mean life expectancy between countries that are partly free and not free is 1.4 (95% CI: -1.61, 4.4).
$\widehat{\beta}_2$	The difference in mean life expectancy between countries that are free and not free is 5.14 (95% CI: 1.76, 8.51).

Don’t forget that we can use the confidence intervals to assess whether the mean difference with “not free” is significant or not

We can also use R to report each status’s average life expectancy

Find the 95% CI’s for the mean life expectancy for the countries that are partly free and free

Use the base R predict() function (see Lesson 4 for more info)
Requires specification of a newdata “value”

newdata = data.frame(freedom_status = c("NF", "PF", "F"))

(pred = predict(model1, 
                newdata=newdata, 
                interval="confidence"))

       fit      lwr      upr
1 68.99387 66.66958 71.31816
2 70.39000 68.48194 72.29806
3 74.12893 71.68329 76.57457

Interpretations

The average life expectancy for countries that are partly free is 70.39 years (95% CI: 68.48, 72.3).
The average life expectancy for countries that are free is 74.13 years (95% CI: 71.68, 76.57).

Another way to look at coefficient values

\[\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{PF}) + \widehat\beta_2 \cdot I(\text{F})\]

Code

# means of each level of `freedom_status`
gapm2_ave <- gapm %>% 
  group_by(freedom_status) %>% 
  summarise(life_ave = mean(life_exp))

# mean of `africa`
mean_NF <- gapm2_ave %>% 
  filter(freedom_status == "NF") %>% 
  pull(life_ave)

# differences in means between levels of `freedom_status` and `africa`
gapm2_ave_diff <- gapm2_ave %>% 
  mutate(`Difference with NF` = life_ave - mean_NF) %>%
  rename(`Freedom status` = freedom_status, 
         `Average life expectancy` = life_ave)

# At the beginning of the Rmd we loaded knitr, which is where the kable command is from
# library(knitr)
gapm2_ave_diff %>% kable(
  digits = 1,
  format = "markdown"
  )

Freedom status	Average life expectancy	Difference with NF
NF	69.0	0.0
PF	70.4	1.4
F	74.1	5.1

\[\widehat{\textrm{LE}} = 68.99 + 1.4 \cdot I(\text{PF}) + 5.14 \cdot I(\text{F})\]

10 minute break here?

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables
Write the regression equation for a categorical variable using reference cell coding
Calculate and interpret coefficients for reference cell coding

Change the reference level in a categorical variable for reference cell coding

Create new variables and interpret coefficient for ordinal / scoring coding

Reference levels

Why is NF not one of the variables in the regression equation?

\[\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{PF}) + \widehat\beta_2 \cdot I(\text{F})\]

Categorical variables must have at least 2 levels. If they have 2 levels, we call them binary

We choose one level as our reference level to which all other levels of the categorical variable are compared
- The levels and $ are compared to the level $\text{not free}$

The intercept of the regression equation is the mean of the outcome restricted to the reference level
- Recall that the intercept is the mean life expectancy of countries that are not free, which was our reference level

If the categorical variable has $r$ levels, then we need $r-1$ variables/coefficients to model it!

We can change the reference level to `PF` (1/2)

Suppose we want to compare the mean life expectancies of freedom statuses to the $\text{partly free}$ level instead of $\text{not free}$
Below is the estimated regression equation for when $\text{not free}$ is the reference level

\[\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{PF}) + \widehat\beta_2 \cdot I(\text{F})\]

Update the variables to make $\text{partly free}$ the reference level:

\[\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{NF}) + \widehat\beta_2 \cdot I(\text{F})\]

We can change the reference level to `PF` (2/2)

Now update the coefficients of the regression equation using the output below.

Freedom status	Average life expectancy	Difference with PF
NF	68.99	-1.40
PF	70.39	0.00
F	74.13	3.74

\[\widehat{\textrm{LE}} = 70.39 -1.4 \cdot I(\text{NF}) + 3.74 \cdot I(\text{F})\]

R: Change reference level to `PF`

freedom_status data type was originally a character
check this with str() or class() or glimpse()

str(gapm$freedom_status)

 Factor w/ 3 levels "NF","PF","F": 1 2 1 3 2 1 2 2 3 2 ...

To change the reference level, we can use fct_relevel() from the forcats package (part of tidyverse)
Any levels not mentioned will be left in their existing order, after the explicitly mentioned levels.

gapm2 = gapm %>% 
 mutate(
   freedom_status = fct_relevel(freedom_status, "PF")
     )

Check the order:

levels(gapm2$freedom_status)

[1] "PF" "NF" "F"

R: Run model with `PF` as the reference level

levels(gapm2$freedom_status)

[1] "PF" "NF" "F"

model2 = gapm2 %>% lm(formula = life_exp ~ freedom_status)
tidy(model2) %>% gt() %>% tab_options(table.font.size = 35) %>% 
  fmt_number(decimals = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	70.39	0.96	73.17	0.00
freedom_statusNF	−1.40	1.52	−0.92	0.36
freedom_statusF	3.74	1.56	2.39	0.02

\[\begin{aligned}\widehat{\textrm{LE}} &= \widehat\beta_0 + \widehat\beta_1 \cdot I(\text{NF}) + \widehat\beta_2 \cdot I(\text{F}) + \\ \widehat{\textrm{LE}} &= 70.39 -1.4 \cdot I(\text{NF}) + 3.74 \cdot I(\text{F}) \end{aligned}\]

Fitted values & residuals

Similar to as before:

Observed values $Y$ are the values in the dataset
Fitted values $\widehat{Y}$ are the values that ~~fall on the best-fit line for a specific value of x~~ are the means of the outcome stratified by the categorical predictor’s levels
Residuals ($\widehat{\epsilon} = Y - \widehat{Y}$) are the differences between the two

Fitted values are the same as the means

m1_aug <- augment(model1)

ggplot(m1_aug, aes(x = freedom_status, y = .fitted)) + geom_point() +
  theme(axis.text = element_text(size = 22), axis.title = element_text(size = 22))

Residual plots (now the spread within each status)

ggplot(m1_aug, aes(x=.resid)) + geom_histogram() + 
  theme(axis.text = element_text(size = 22), title = element_text(size = 22))

Poll Everywhere Question 3

Learning Objectives

Understand why we need a new way to code categorical variables compared to continuous variables
Write the regression equation for a categorical variable using reference cell coding
Calculate and interpret coefficients for reference cell coding
Change the reference level in a categorical variable for reference cell coding

Create new variables and interpret coefficient for ordinal / scoring coding

Let’s look at life expectancy vs. four income levels

Gapminder discusses individual income levels

Income levels for a country is based on average GDP per capita, and grouped into:
- Low income
- Lower middle income
- Upper middle income
- High income

Visualizing the ordinal variable, income levels

A few changes needed:

Put the income levels in order

gapm2 = gapm2 %>%
 mutate(income_level_4 = fct_relevel(
          income_level_4, 
            "Low income", 
            "Lower middle income", 
            "Upper middle income", 
            "High income"
 ))

Make the income levels readable
- How to Rotate Axis Labels in ggplot2?

Much better: Visualizing the ordinal variable, income levels

ggplot(gapm2, aes(x = income_level_4, y = life_exp)) +
  geom_jitter(size = 1, alpha = .6, width = 0.2) +
  stat_summary(fun = mean, geom = "point", size = 8, shape = 18) +
  labs(x = "Income levels", 
       y = "Country life expectancy (years)",
       title = "Life expectancy vs. income levels",
       caption = "Diamonds = Income level averages") +
  theme(axis.title = element_text(size = 20), 
        axis.text = element_text(size = 20), 
        title = element_text(size = 20), 
        axis.text.x=element_text(angle = 20, vjust = 1, hjust=1))

How can we code this variable?

We have two options:

Treat the levels as nominal, and use reference cell coding

Like we did with freedom status
This option will not break the linearity assumption
For $g$ categories of the variable, we will have $g-1$ coefficients to estimate

Use the ordinal values to score the levels and treat as a numerical variable

Even if a variable is inherently ordered, we need to check that linearity holds if categories are represented as numbers
This way of coding preserves more power in the model (less coefficients to estimate means more power)
Only one coefficient to estimate

Some important considerations when scoring ordinal variables

Even if a variable is inherently ordered, we need to check that linearity holds if categories are represented as numbers (more in next lessons)
- Linearity is an assumption of linear regression: that the relationship between X and Y is linear

Assumes differences between adjacent groups are equal
- Income levels are pre-set groups by Gapminder
- Might be hard to interpret “every 1-level increase in income level”

Is the variable part of the main relationship that you are investigating? (even if linearity holds)
- If yes, consider leaving as reference cell coding unless the interpretation makes sense
- If no, and just needed as an adjustment in your model, then power benefit of scoring might be worth it!

Check that linearity holds for income levels

Using visual assessment, linearity holds for our income levels (more in next lessons)
We can use the ordinal encoding for income levels

Poll Everywhere Question 4

Ordinal coding / Scoring

Map each income level to a number
Usually start at 1

Income Level	Score
Low income	1
Lower middle income	2
Upper middle income	3
High income	4

gapm2 = gapm2 %>%
  mutate(income_num = case_when(
    income_level_4 == "Low income" ~ 1,
    income_level_4 == "Lower middle income" ~ 2,
    income_level_4 == "Upper middle income" ~ 3,
    income_level_4 == "High income" ~ 4
  ))
gapm2 %>% select(income_level_4, income_num) %>% 
  head(6)

# A tibble: 6 × 2
  income_level_4      income_num
  <fct>                    <dbl>
1 Low income                   1
2 Upper middle income          3
3 High income                  4
4 Upper middle income          3
5 Upper middle income          3
6 Upper middle income          3

Run the model with the scored income

mod_inc2 = gapm2 %>% lm(formula = life_exp ~ income_num)
tidy(mod_inc2) %>% gt() %>% tab_options(table.font.size = 37) %>%
  fmt_number(decimals = 2)

term	estimate	std.error	statistic	p.value
(Intercept)	58.51	1.38	42.47	0.00
income_num	4.90	0.51	9.66	0.00

\[\begin{aligned} \widehat{\textrm{LE}} &= \widehat\beta_0 + \widehat\beta_1 \cdot \text{Income level} \\ \widehat{\textrm{LE}} &= 58.51 + 4.9 \cdot \text{Income level} \end{aligned}\]

Keep in mind: We cannot calculate the expected outcome outside of the scoring values
- For example, we cannot find the mean life expectancy for an income level of 1.5

Interpreting the model

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	58.51	1.38	42.47	0.00	55.78	61.24
income_num	4.90	0.51	9.66	0.00	3.90	5.91

\[\widehat{\textrm{LE}} = 54.01 + 6.25 \cdot \text{Income level}\]

Interpreting the intercept: At an income level of 0, mean life expectancy is 54.01 (95% CI: 51.92, 56.10).
- Note: this does not make sense because there is no income level of 0!
Interpreting the coefficient for income: For every 1-level increase in income level, mean life expectancy increases 4.9 years (95% CI: 3.9, 5.91).

What if life expectancy vs. income level looked like this?

No longer maintaining the linearity assumption
Need to use reference cell coding

We would fit the following model: \[\begin{aligned} \textrm{LE} = & \beta_0 + \beta_1 \cdot I(\text{Lower middle income}) + \\ & \beta_2 \cdot I(\text{Upper middle income}) + \\ & \beta_3 \cdot I(\text{High income}) + \epsilon \end{aligned}\]

If time…

Let’s walk through categorical variables that have multiple selections

So each group is not mutually exclusive
We could make an indicator for each category, but individuals could be a part of multiple categories

Also, thinking about income levels - can we combine two groups to make one??

Freedom status	Regression equation for FS	Average Life Expectancy for FS
Not free	\(\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 0 + \widehat\beta_2 \cdot 0 \end{aligned}\)	\(\widehat{\textrm{LE}} = \widehat\beta_0\)
Partly free	\(\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 1+ \widehat\beta_2 \cdot 0\end{aligned}\)	\(\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_1\)
Free	\(\begin{aligned} \widehat{\textrm{LE}} = &\widehat\beta_0 + \widehat\beta_1 \cdot 0 + \widehat\beta_2 \cdot 1\end{aligned}\)	\(\widehat{\textrm{LE}} = \widehat\beta_0 + \widehat\beta_2\)

Coefficient	Interpretation
\(\widehat{\beta}_0\)	Average life expectancy of countries that are not free is 68.99 (95% CI: 66.67, 71.32).
\(\widehat{\beta}_1\)	The difference in mean life expectancy between countries that are partly free and not free is 1.4 (95% CI: -1.61, 4.4).
\(\widehat{\beta}_2\)	The difference in mean life expectancy between countries that are free and not free is 5.14 (95% CI: 1.76, 8.51).

Lesson 5: SLR-ish: Categorical Covariates

Learning Objectives

Why “SLR-ish”?

Let’s map that to our regression analysis process

Learning Objectives

Still looking at Gapminder Life Expectancy data

Visualizing the relationship

Linear regression with a categorical covariate

There are different ways to code categorical variables

Building the regression equation: problem with a single coefficient

Building the regression equation: how do we map categories to means?

Building the regression equation: Indicator functions

Poll Everywhere Question 1

Learning Objectives

Building the regression equation: Indicators in our equation

Viewing the regression equation another way

Learning Objectives

Interpretation of regression equation coefficients

Poll Everywhere Question 2

Regression table with lm() function

Bringing in the numbers/units/95% CI

We can also use R to report each status’s average life expectancy

Another way to look at coefficient values

10 minute break here?

Learning Objectives

Reference levels

We can change the reference level to PF (1/2)

We can change the reference level to PF (2/2)

R: Change reference level to PF

R: Run model with PF as the reference level

Fitted values & residuals

Fitted values are the same as the means

Residual plots (now the spread within each status)

Poll Everywhere Question 3

Learning Objectives

Let’s look at life expectancy vs. four income levels

Visualizing the ordinal variable, income levels

Much better: Visualizing the ordinal variable, income levels

How can we code this variable?

Some important considerations when scoring ordinal variables

Check that linearity holds for income levels

Poll Everywhere Question 4

Ordinal coding / Scoring

Run the model with the scored income

Interpreting the model

What if life expectancy vs. income level looked like this?

If time…

Regression table with `lm()` function

We can change the reference level to `PF` (1/2)

We can change the reference level to `PF` (2/2)

R: Change reference level to `PF`

R: Run model with `PF` as the reference level