2025-01-15


Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
We fit Gapminder data with female literacy rate as our independent variable and life expectancy as our dependent variable
We used OLS to find the coefficient estimates of our best-fit line
model1 <- gapm %>% lm(formula = LifeExpectancyYrs ~ FemaleLiteracyRate)
# Get regression table:
tidy(model1) %>% gt() %>%
tab_options(table.font.size = 40) %>%
fmt_number(decimals = 2)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 50.93 | 2.66 | 19.14 | 0.00 |
| FemaleLiteracyRate | 0.23 | 0.03 | 7.38 | 0.00 |

The (population) regression model is denoted by:
\[Y = \beta_0 + \beta_1X + \epsilon\]
\(\beta_0\) and \(\beta_1\) are unknown population parameters
\(\epsilon\) (epsilon) is the error about the line
It is assumed to be a random variable with a…
Normal distribution with mean 0 and constant variance \(\sigma^2\)
i.e. \(\epsilon \sim N(0, \sigma^2)\)
\[\widehat{\sigma}^2 = S_{y|x}^2= \frac{1}{n-2}\sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 =\frac{1}{n-2}SSE = MSE\]
The standard deviation \(\widehat{\sigma}\) is given in the R output as the Residual standard error
summary() output of the model:
Call:
lm(formula = LifeExpectancyYrs ~ FemaleLiteracyRate, data = .)
Residuals:
Min 1Q Median 3Q Max
-22.299 -2.670 1.145 4.114 9.498
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.92790 2.66041 19.143 < 2e-16 ***
FemaleLiteracyRate 0.23220 0.03148 7.377 1.5e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.142 on 78 degrees of freedom
Multiple R-squared: 0.4109, Adjusted R-squared: 0.4034
F-statistic: 54.41 on 1 and 78 DF, p-value: 1.501e-10
\[\begin{aligned} \widehat{\sigma}^2 & = \frac{1}{n-2}SSE\\ 6.142^2 & = \frac{1}{80-2}SSE\\ SSE & = 78 \cdot 6.142^2 = 2942.48 \end{aligned}\]
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
Population model
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\(\sigma^2\) is the variance of the residuals
Sample best-fit (least-squares) line
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]
Note: Some sources use \(b\) instead of \(\widehat{\beta}\)
We have two options for inference:
Note: R reports p-values for 2-sided tests
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
Calculate the test statistic.
Calculate the p-value
Write a conclusion

\[\text{SE}_{\widehat\beta_1} = \frac{s_{\textrm{residuals}}}{s_x\sqrt{n-1}}\]
\(\text{SE}_{\widehat\beta_1}\) is the variability of the statistic \(\widehat\beta_1\)

# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.411 0.403 6.14 54.4 1.50e-10 1 -258. 521. 529.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# standard deviation of the residuals (Residual standard error in summary() output)
(s_resid <- glance(model1)$sigma)[1] 6.142157
[1] 21.95371
[1] 80
[1] 0.03147744

# recall model1_b1 is regression table restricted to b1 row
model1_b1 <-tidy(model1) %>% filter(term == "FemaleLiteracyRate")
model1_b1 %>% gt() %>%
tab_options(table.font.size = 45) %>% fmt_number(decimals = 4)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| FemaleLiteracyRate | 0.2322 | 0.0315 | 7.3766 | 0.0000 |
The test statistic for a single coefficient follows a Student’s t-distribution
Single coefficient testing can be done on any coefficient, but it is most useful for continuous covariates or binary covariates
We are testing if the slope is 0 or not:
\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]Often we use \(\alpha = 0.05\)
The test statistic is \(t\), and follows a Student’s t-distribution.
# recall model1_b1 is regression table restricted to b1 row
model1_b1 <-tidy(model1) %>% filter(term == "FemaleLiteracyRate")
model1_b1 %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 2)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| FemaleLiteracyRate | 0.23 | 0.03 | 7.38 | 0.00 |
[1] 7.376557
RThe \(p\)-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true
We know the probability distribution of the test statistic (the null distribution) assuming \(H_0\) is true
Statistical theory tells us that the test statistic \(t\) can be modeled by a \(t\)-distribution with \(df = n-2\).
Option 1: Use pt() and our calculated test statistic
We reject the null hypothesis that the slope is 0 at the \(5\%\) significance level. There is sufficient evidence that there is significant association between female life expectancy and female literacy rates (p-value < 0.0001).
R
In our assignments: if you use Option 2, Step 5 is optional
We are testing if the intercept is 0 or not:
\[\begin{align} H_0 &: \beta_0 = 0\\ \text{vs. } H_A&: \beta_0 \neq 0 \end{align}\]Often we use \(\alpha = 0.05\)
This is the same as the slope. The test statistic is \(t\), and follows a Student’s t-distribution.
# recall model1_b1 is regression table restricted to b1 row
model1_b0 <-tidy(model1) %>% filter(term == "(Intercept)")
model1_b0 %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 2)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 50.93 | 2.66 | 19.14 | 0.00 |
[1] 19.1429
R
pt() and our calculated test statistic
We reject the null hypothesis that the intercept is 0 at the \(5\%\) significance level. There is sufficient evidence that the intercept for the association between average female life expectancy and female literacy rates is different from 0 (p-value < 0.0001).
Population model
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\(\sigma^2\) is the variance of the residuals
Sample best-fit (least-squares) line
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]
Note: Some sources use \(b\) instead of \(\widehat{\beta}\)
We have two options for inference:
Note: R reports p-values for 2-sided tests
Recall the general CI formula:
\[\widehat{\beta}_1 \pm t_{\alpha, n-2}^* \cdot SE_{\widehat{\beta}_1}\]
To construct the confidence interval, we need to:
Set our \(\alpha\)-level
Find \(\widehat\beta_1\)
Calculate the \(t_{n-2}^*\)
Calculate \(SE_{\widehat{\beta}_1}\)
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
Save values needed for CI:
Use formula to calculate each bound
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
When we report our results to someone else, we don’t usually show them our full hypothesis test
Typically, we report the estimate with the confidence interval
Once we found our CI, we often just write the interpretation of the coefficient estimate:
General statement for population slope inference
For every increase of 1 unit in the \(X\)-variable, there is an expected/average (pick one) increase of \(\widehat\beta_1\) units in the \(Y\)-variable (95%: LB, UB).
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\)
tidy(model1, conf.int = T) %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 3)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 50.928 | 2.660 | 19.143 | 0.000 | 45.631 | 56.224 |
| FemaleLiteracyRate | 0.232 | 0.031 | 7.377 | 0.000 | 0.170 | 0.295 |
General statement for population intercept inference
The expected outcome for the \(Y\)-variable is (\(\widehat\beta_0\)) when the \(X\)-variable is 0 (95% CI: LB, UB).
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 50.928 | 2.660 | 19.143 | 0.000 |
| FemaleLiteracyRate | 0.232 | 0.031 | 7.377 | 0.000 |
\[\widehat{\textrm{life expectancy}} = 50.9 + 0.232 \cdot \textrm{female literacy rate} \]
\[\widehat{\textrm{life expectancy}} = 50.9 + 0.232 \cdot 60 = 64.82\]
How do we interpret the expected value?
How variable is it?
Recall the population model:
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\[\widehat{E}[Y|X^*] = \widehat\beta_0 + \widehat\beta_1 X^*\]

\[\widehat{E}[Y|X^*] \pm t_{n-2}^* \cdot SE_{\widehat{E}[Y|X^*]}\]
\[SE_{\widehat{E}[Y|X^*]} = s_{\text{residuals}} \sqrt{\frac{1}{n} + \frac{(X^* - \overline{X})^2}{(n-1)s_X^2}}\]
qt() and depends on the confidence level (\(1-\alpha\))Find the 95% CI for the mean life expectancy when the female literacy rate is 60.
Find the 95% CI’s for the mean life expectancy when the female literacy rate is 60 and 80.
predict() functionnewdata “value”
newdata value is \(X^*\)se = TRUE within geom_smoothLesson 4: SLR 2