2026-02-18


Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
\[ \begin{aligned} SSE & = \displaystyle\sum^n_{i=1} \widehat\epsilon_i^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - \widehat{Y}_i)^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - (\widehat{\beta}_0 +\widehat{\beta}_1 X_{i1}+ \ldots+\widehat{\beta}_1 X_{ik}))^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - \widehat{\beta}_0 -\widehat{\beta}_1 X_{i1}- \ldots-\widehat{\beta}_1 X_{ik})^2 \end{aligned}\]
[L] Linearity of relationship between variables
The mean value of \(Y\) given any combination of \(X_1, X_2, \ldots, X_k\) values, is a linear function of \(\beta_0, \beta_1, \beta_2, \ldots, \beta_k\):
\[\mu_{Y|X_1, \ldots, X_k} = \beta_0 + \beta_1 X_1 + \beta_2 X_2+ \ldots + \beta_k X_k\]
[I] Independence of the \(Y\) values
Observations (\(X_1, X_2, \ldots, X_k, Y\)) are independent from one another
[N] Normality of the \(Y\)’s given \(X\) (residuals)
\(Y\) has a normal distribution for any any combination of \(X_1, X_2, \ldots, X_k\) values
[E] Equality of variance of the residuals (homoscedasticity)
The variance of \(Y\) is the same for any any combination of \(X_1, X_2, \ldots, X_k\) values
\[\sigma^2_{Y|X_1, X_2, \ldots, X_k} = Var(Y|X_1, X_2, \ldots, X_k) = \sigma^2\]
The square of a \(t\)-distribution with \(df = \nu\) is an \(F\)-distribution with \(df = 1, \nu\)
\[T_{\nu}^2 \sim F_{1,\nu}\]
We can think about the hypothesis test for the slope…
Null \(H_0\)
\(\beta_1=0\)
Alternative \(H_1\)
\(\beta_1\neq0\)
in a slightly different way…
Null model (\(\beta_1=0\))
Alternative model (\(\beta_1\neq0\))
In multiple linear regression, we can start using this framework to test multiple coefficient parameters at once
Decide whether or not to reject the smaller reduced model in favor of the larger full model
Cannot do this with the t-test!
We can create a hypothesis test for more than one coefficient at a time…
Null \(H_0\)
\(\beta_1=\beta_2=0\)
Alternative \(H_1\)
\(\beta_1\neq0\) and/or \(\beta_2\neq0\)
in a slightly different way…
Null model
Alternative* model
*This is not quite the alternative, but if we reject the null, then this is the model we move forward with
\[\begin{aligned} \sum_{i=1}^n (Y_i - \overline{Y})^2 &= \sum_{i=1}^n (\widehat{Y}_i- \overline{Y})^2 + \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 \\ SSY &= SSR + SSE \end{aligned}\]
Let’s create a data frame of each component within the SS’s
Using our simple linear regression model as an example:
SSY_plot = ggplot(SS_dev_slr, aes(SSY_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) + xlab(expression(Y[i] - bar(Y)))
SSR_plot = ggplot(SS_dev_slr, aes(SSR_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) +xlab(expression(hat(Y)[i] - bar(Y)))
SSE_plot = ggplot(SS_dev_slr, aes(SSE_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) + xlab(expression(Y[i] - hat(Y)[i]))
grid.arrange(SSY_plot, SSR_plot, SSE_plot, nrow = 3)
\[SSY = \sum_{i=1}^n (Y_i - \overline{Y})^2 = 45.75\]
\[SSR = \sum_{i=1}^n (\widehat{Y}_i- \overline{Y})^2 = 10.52\]
\[SSE =\sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = 35.23\]
Let’s create a data frame of each component within the SS’s
Using our simple linear regression model as an example:
SSY_plot = ggplot(SS_df, aes(SSY_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) + xlab(expression(Y[i] - bar(Y)))
SSR_plot = ggplot(SS_df, aes(SSR_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) +xlab(expression(hat(Y)[i] - bar(Y)))
SSE_plot = ggplot(SS_df, aes(SSE_dev)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35) + xlab(expression(Y[i] - hat(Y)[i]))
grid.arrange(SSY_plot, SSR_plot, SSE_plot, nrow = 3)
\[SSY = \sum_{i=1}^n (Y_i - \overline{Y})^2 = 45.75\]
\[SSR = \sum_{i=1}^n (\widehat{Y}_i- \overline{Y})^2 = 12.28\]
\[SSE =\sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = 33.47\]
Simple Linear Regression
Multiple Linear Regression

\[SSY = 45.75\]
\[SSR = 10.52\]
\[SSE =35.23\]

\[SSY = 45.75\]
\[SSR = 12.28\]
\[SSE =33.47\]
\[ F = \dfrac{\frac{SSE_{red} - SSE_{full}}{df_{red} - df_{full}}}{\frac{SSE_{full}}{df_{full}}} \]
New population model for example:
\[\text{LE} = \beta_0 + \beta_1 \text{CP} + \beta_2 \text{VR} + \epsilon\]
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 46.833 | 6.042 | 7.751 | 0.000 | 34.848 | 58.818 |
| cell_phones_100 | 0.075 | 0.018 | 4.074 | 0.000 | 0.039 | 0.112 |
| vax_rate | 0.168 | 0.073 | 2.318 | 0.022 | 0.024 | 0.312 |
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{LE}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{CP} + \widehat{\beta}_2 \text{VR} \\ \widehat{\text{LE}} &= 46.833 + 0.075 \ \text{CP} + 0.168\ \text{VR} \end{aligned}\]
Overall test
Does at least one of the covariates/predictors contribute significantly to the prediction of Y?
Test for addition of a single variable’s coefficient (covariate subset test)
Does the addition of one particular covariate (with a single coefficient) add significantly to the prediction of Y achieved by other covariates already present in the model?
Test for addition of group of variables’ coefficient (covariate subset test)
Does the addition of some group of covariates (or one covariate with multiple coefficients) add significantly to the prediction of Y achieved by other covariates already present in the model?
Does at least one of the covariates/predictors contribute significantly to the prediction of Y?
We can create a hypothesis test for all the covariate coefficients…
Null \(H_0\)
\(\beta_1=\beta_2= \ldots=\beta_k=0\)
Alternative \(H_1\)
At least one \(\beta_j\neq0\) (for \(j=1, 2, \ldots, k\))
Null / Smaller / Reduced model
\(Y = \beta_0 + \epsilon\)
Alternative / Larger / Full model
\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon\)
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} = \frac{MSR_{full}}{MSE_{full}}\]
Calculate the p-value
Write a conclusion
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/ insufficient) evidence that at least one of the coefficients is not 0 (p-value = \(P(F_{k, n-k-1} > F)\)).
Our proposed population model
\[\text{LE} = \beta_0 + \beta_1 \text{CP} + \beta_2 \text{VR} + \epsilon\]
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{LE}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{CP} + \widehat{\beta}_2 \text{VR} \\ \widehat{\text{LE}} &= 46.833 + 0.075 \ \text{CP} + 0.168\ \text{VR} \end{aligned}\]
Our main question for the Overall F-test: Is the regression model containing cell phones and vaccination rate useful in estimating countries’ life expectancy?
Null / Smaller / Reduced model
\(LE = \beta_0 + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 CP + \beta_2 VR + \epsilon\)
Reduced / null model
\[LE = \beta_0 + \epsilon\]

\[SSY = 45.75\]
\[SSR = 0\]
\[SSE = 45.75\]
Full / Alternative model
\[LE = \beta_0 + \beta_1 CP + \beta_2 VR + \epsilon\]

\[SSY = 45.75\]
\[SSR = 12.28\]
\[SSE = 33.47\]
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}=44.443\] OR use ANOVA table:
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that either countries’ number of cell phones or vaccination rate (or both) contributes significantly to the prediction of life expectancy (p-value < 0.001).
Does the addition of one particular covariate of interest (a numeric covariate with only one coefficient) add significantly to the prediction of Y achieved by other covariates already present in the model?
We can create a hypothesis test for a single \(j\) covariate coefficient (where \(j\) can be any value \(1, 2, \ldots, k\))…
Null \(H_0\)
\(\beta_j=0\)
Alternative \(H_1\)
\(\beta_j\neq0\)
Null / Smaller / Reduced model
\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon\)
Alternative / Larger / Full model
\(\begin{aligned}Y = &\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_j X_j +\\ &\ldots + \beta_k X_k + \epsilon \end{aligned}\)
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\]
We are generally calculating: \(P(F_{k, n-k-1} > F)\)
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/insufficient) evidence that predictor/covariate \(j\) significantly improves the prediction of Y, given all the other covariates are in the model (p-value = \(P(F_{1, n-2} > F)\)).
Our proposed population model
\[\text{LE} = \beta_0 + \beta_1 \text{CP} + \beta_2 \text{VR} + \epsilon\]
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{LE}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{CP} + \widehat{\beta}_2 \text{VR} \\ \widehat{\text{LE}} &= 33.595 + 0.157\ \text{CP} + 0.008\ \text{VR} \end{aligned}\]
Our main question for the single covariate subset F-test: Is the regression model containing vaccination rate improve the estimation of countries’ life expectancy, given cell phones per 100 people is already in the model?
Null / Smaller / Reduced model
\(LE = \beta_0 + \beta_1 CP + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 CP + \beta_2 VR + \epsilon\)
Reduced / null model
\[LE = \beta_0 + \beta_1 CP + \epsilon\]

\[SSY = 45.75\]
\[SSR = 10.52\]
\[SSE = 35.23\]
Full / Alternative model
\[LE = \beta_0 + \beta_1 CP + \beta_2 VR + \epsilon\]

\[SSY = 45.75\]
\[SSR = 12.28\]
\[SSE = 33.47\]
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Often we use \(\alpha = 0.05\)
Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\] ANOVA table:
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| life_exp ~ cell_phones_100 | 103.000 | 3,663.747 | NA | NA | NA | NA |
| life_exp ~ cell_phones_100 + vax_rate | 102.000 | 3,480.371 | 1.000 | 183.376 | 5.374 | 0.022 |
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that countries’ vaccination rate contributes significantly to the prediction of life expectancy, given that cell phones per 100 people is already in the model (p-value < 0.001).
Does the addition of some group of covariates of interest (or a multi-level categorical variable) add significantly to the prediction of Y obtained through other independent variables already present in the model?
We can create a hypothesis test for a group of covariate coefficients (subset of many)… For example…
Null \(H_0\)
\(\beta_1=\beta_3 =0\) (this can be any coefficients)
Alternative \(H_1\)
At least one \(\beta_j\neq0\) (for \(j=2,3\))
Null / Smaller / Reduced model
\(Y = \beta_0 + \beta_2 X_2 + \epsilon\)
Alternative / Larger / Full model
\(Y = \beta_0 + \beta_1 X + \beta_2 X + \beta_3 X_3+\epsilon\)
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
For example:
\[\begin{align} H_0 &: \beta_1 = \beta_3 = 0\\ \text{vs. } H_A&: \text{At least one } \beta_j\neq0, \text{for }j=1,3 \end{align}\]Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\]
We are generally calculating: \(P(F_{k, n-k-1} > F)\)
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/insufficient) evidence that predictors/covariates \(2,3\) significantly improve the prediction of Y, given all the other covariates are in the model (p-value = \(P(F_{1, n-2} > F)\)).
Our proposed population model to include percent access to basic sanitation (BS):
\[\text{LE} = \beta_0 + \beta_1 \text{CP} + \beta_2 \text{VR} + \beta_3 BS + \epsilon\]
Our main question for the group covariate subset F-test: Is the regression model containing vaccination rate and basic sanitation percent improve the estimation of countries’ life expectancy, given percent cell phones per 100 people is already in the model?
Null / Smaller / Reduced model
\(LE = \beta_0 + \beta_1 CP + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 CP + \beta_2 VR + \beta_3 BS + \epsilon\)
Reduced / null model
\[LE = \beta_0 + \beta_1 CP + \epsilon\]

\[SSY = 45.75\]
\[SSR = 10.52\]
\[SSE = 35.23\]
Full / Alternative model
\[LE = \beta_0 + \beta_1 CP + \beta_2 VR + \beta_3 BS + \epsilon\]

\[SSY = 45.75\]
\[SSR = 24.47\]
\[SSE = 21.28\]
Check the assumptions
Set the level of significance
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Specify the test statistic and its distribution under the null
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\] ANOVA table:
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| life_exp ~ cell_phones_100 | 103.000 | 3,663.747 | NA | NA | NA | NA |
| life_exp ~ cell_phones_100 + vax_rate + basic_sani | 101.000 | 2,213.194 | 2.000 | 1,450.552 | 33.098 | 0.000 |
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that countries’ vaccination rate or basic sanitation (or both) contribute significantly to the prediction of life expectancy, given that cell phones per 100 people is already in the model (p-value < 0.001).
Single covariate subset F-test
Group covariate subset F-test
Lesson 10: MLR 2