2026-01-14


Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
We fit Gapminder data with cell phones as our independent variable and life expectancy as our dependent variable
We used OLS to find the coefficient estimates of our best-fit line
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 60.04 | 2.06 | 29.21 | 0.00 |
| cell_phones_100 | 0.09 | 0.02 | 5.55 | 0.00 |

The (population) regression model is denoted by:
\[Y = \beta_0 + \beta_1X + \epsilon\]
\(\beta_0\) and \(\beta_1\) are unknown population parameters
\(\epsilon\) (epsilon) is the error about the line
It is assumed to be a random variable with a…
Normal distribution with mean 0 and constant variance \(\sigma^2\)
i.e. \(\epsilon \sim N(0, \sigma^2)\)



We need the variance of the residuals \(\sigma^2\) to perform inference on our coefficients
The variance of the errors (residuals) is estimated by \(\widehat{\sigma}^2\)
\[\widehat{\sigma}^2 = S_{y|x}^2= \frac{1}{n-2}\sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 =\frac{1}{n-2}SSE = MSE\]
The standard deviation \(\widehat{\sigma}\) is given in the R output as the Residual standard error
summary() output of the model:
Call:
lm(formula = life_exp ~ cell_phones_100, data = .)
Residuals:
Min 1Q Median 3Q Max
-17.211 -3.268 0.615 3.818 12.449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.04051 2.05567 29.207 < 2e-16 ***
cell_phones_100 0.09384 0.01692 5.546 2.27e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.964 on 103 degrees of freedom
Multiple R-squared: 0.23, Adjusted R-squared: 0.2225
F-statistic: 30.76 on 1 and 103 DF, p-value: 2.271e-07
\[\begin{aligned} \widehat{\sigma}^2 & = \frac{1}{n-2}SSE\\ 5.964^2 & = \frac{1}{105-2}SSE\\ SSE & = 103 \cdot 5.964^2 = 3663.75 \end{aligned}\]
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
Population model
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\(\sigma^2\) is the variance of the residuals
Sample best-fit (least-squares) line
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]
Note: Some sources use \(b\) instead of \(\widehat{\beta}\)
We have two options for inference:
Note: R reports p-values for 2-sided tests
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
Calculate the test statistic.
Calculate the p-value
Write a conclusion

\[\text{SE}_{\widehat\beta_1} = \frac{\widehat{\sigma}}{s_X\sqrt{n-1}}\]
\(\text{SE}_{\widehat\beta_1}\) is a measure of variability of the estimate \(\widehat\beta_1\)

# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.230 0.222 5.96 30.8 0.000000227 1 -335. 677. 685.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
[1] 5.964089
[1] 34.56469
[1] 105
[1] 0.01691978

The test statistic for a single coefficient follows a Student’s t-distribution
Single coefficient testing can be done on any coefficient, but it is most useful for continuous covariates or binary covariates
Check the assumptions: We have met the underlying assumptions (checked in our Model Evaluation step)
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
5.Calculate the test statistic
Option 1: Calculate the test statistic using the values in the regression table
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| cell_phones_100 | 0.094 | 0.017 | 5.546 | 0.000 |
[1] 5.546063
6.Calculate the p-value
The \(p\)-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true
We know the probability distribution of the test statistic (the null distribution) assuming \(H_0\) is true
Statistical theory tells us that the test statistic \(t\) can be modeled by a \(t\)-distribution with \(df = n-2\).
Option 1: Use pt() and our calculated test statistic
We reject the null hypothesis that the slope is 0 at the \(5\%\) significance level. There is sufficient evidence that there is association between life expectancy and number of cell phones per 100 people (p-value < 0.0001).
R
In our assignments: if you use Option 2, Step 5and 6 become one
Check the assumptions: We have met the underlying assumptions (checked in our Model Evaluation step)
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
5.Calculate the test statistic
Option 1: Calculate the test statistic using the values in the regression table
6.Calculate the p-value
Option 1: Use pt() and our calculated test statistic
We reject the null hypothesis that the intercept is 0 at the \(5\%\) significance level. There is sufficient evidence that the intercept for the association between life expectancy and number of cell phones per 100 people is different from 0 (p-value < 0.0001).
Population model
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\(\sigma^2\) is the variance of the residuals
Sample best-fit (least-squares) line
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]
Note: Some sources use \(b\) instead of \(\widehat{\beta}\)
We have two options for inference:
Note: R reports p-values for 2-sided tests
Recall the general CI formula:
\[\widehat{\beta}_1 \pm t_{\alpha, n-2}^* \cdot SE_{\widehat{\beta}_1}\]
To construct the confidence interval, we need to:
Set our \(\alpha\)-level
Find \(\widehat\beta_1\)
Calculate the \(t_{n-2}^*\)
Calculate \(SE_{\widehat{\beta}_1}\)
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
Save values needed for CI:
Use formula to calculate each bound
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
When we report our results to someone else, we don’t usually show them our full hypothesis test
Typically, we report the estimate with the confidence interval
Once we found our CI, we often just write the interpretation of the coefficient estimate:
General statement for population slope inference
For every increase of 1 unit in the \(X\)-variable, there is an expected/average (pick one) increase of \(\widehat\beta_1\) units in the \(Y\)-variable (95%: LB, UB).
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 60.041 | 2.056 | 29.207 | 0.000 | 55.964 | 64.117 |
| cell_phones_100 | 0.094 | 0.017 | 5.546 | 0.000 | 0.060 | 0.127 |
General statement for population intercept inference
The expected outcome for the \(Y\)-variable is (\(\widehat\beta_0\)) when the \(X\)-variable is 0 (95% CI: LB, UB).
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 60.041 | 2.056 | 29.207 | 0.000 |
| cell_phones_100 | 0.094 | 0.017 | 5.546 | 0.000 |
\[\widehat{\textrm{life expectancy}} = 60.041 + 0.094 \cdot \textrm{cell phones} \]
\[\widehat{\textrm{life expectancy}} = 60.041 + 0.094 \cdot 60 = 65.671\]
How do we interpret the expected value?
How variable is it?
Recall the population model:
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\[\widehat{E}[Y|X^*] = \widehat\beta_0 + \widehat\beta_1 X^*\]

\[\widehat{E}[Y|X^*] \pm t_{n-2}^* \cdot SE_{\widehat{E}[Y|X^*]}\]
\[SE_{\widehat{E}[Y|X^*]} = \widehat{\sigma} \sqrt{\frac{1}{n} + \frac{(X^* - \overline{X})^2}{(n-1)s_X^2}}\]
qt() and depends on the confidence level (\(1-\alpha\))Find the 95% CI’s for mean life expectancy for 60 cell phones per 100 people:
\[\begin{align} \widehat{E}[Y|X^*] &\pm t_{n-2}^* \cdot SE_{\widehat{E}[Y|X^*]}\\ 65.671 &\pm 1.983 \cdot \widehat{\sigma} \sqrt{\frac{1}{n} + \frac{(X^* - \bar{x})^2}{(n-1)s_X^2}}\\ 65.671 &\pm 1.983 \cdot 5.964 \sqrt{\frac{1}{105} + \frac{(60 - 116.523)^2}{(105-1)34.565^2}}\\ 65.671 &\pm 1.983 \cdot 1.12\\ 65.671 &\pm 2.22\\ (63.45 &, 67.891) \end{align}\]Find the 95% CI’s for mean life expectancy for 60 and 80 cell phones per 100 people, respectively.
predict() functionnewdata “value”
newdata value is \(X^*\)se = TRUE within geom_smoothLesson 4: SLR 2