2024-02-07
Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
General interpretation for \(\widehat{\beta}_0\)
The expected \(Y\)-variable is (\(\widehat\beta_0\) units) when the \(X_1\)-variable is 0 \(X_1\)-units and \(X_2\)-variable is 0 \(X_1\)-units (95% CI: LB, UB).
General interpretation for \(\widehat{\beta}_1\)
For every increase of 1 \(X_1\)-unit in the \(X_1\)-variable, adjusting/controlling for \(X_2\)-variable, there is an expected increase/decrease of \(|\widehat\beta_1|\) units in the \(Y\)-variable (95%: LB, UB).
General interpretation for \(\widehat{\beta}_2\)
For every increase of 1 \(X_2\)-unit in the \(X_2\)-variable, adjusting/controlling for \(X_1\)-variable, there is an expected increase/decrease of \(|\widehat\beta_2|\) units in the \(Y\)-variable (95%: LB, UB).
We fit the regression model in R and printed the regression table:
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 33.595 | 4.472 | 7.512 | 0.000 | 24.674 | 42.517 |
FemaleLiteracyRate | 0.157 | 0.032 | 4.873 | 0.000 | 0.093 | 0.221 |
FoodSupplykcPPD | 0.008 | 0.002 | 4.726 | 0.000 | 0.005 | 0.012 |
Fitted multiple regression model: \(\widehat{\text{LE}} = 33.595 + 0.157 \text{ FLR} + 0.008 \text{ FS}\)
Interpretation for \(\widehat{\beta}_0\)
The expected life expectancy is 33.595 years when the female literacy rate is 0% and food supply is 0 0 kcal PPD (95% CI: 24.674, 41.517).
Interpretation for \(\widehat{\beta}_1\)
For every 1% increase in the female literacy rate, adjusting for food supply, there is an expected increase of 0.157 years in the life expectancy (95%: 0.093, 0.221).
Interpretation for \(\widehat{\beta}_2\)
For every 1 kcal PPD increase in the food supply, adjusting for female literacy rate, there is an expected increase of 0.008 years in life expectancy (95%: 0.005, 0.012).
General interpretation for \(\widehat{\beta}_0\)
The expected \(Y\)-variable is (\(\widehat\beta_0\) units) when the \(X_1\)-variable is 0 \(X_1\)-units and \(X_2\)-variable is 0 \(X_1\)-units (95% CI: LB, UB).
General interpretation for \(\widehat{\beta}_1\)
For every increase of 1 \(X_1\)-unit in the \(X_1\)-variable, adjusting/controlling for \(X_2\)-variable, there is an expected increase/decrease of \(|\widehat\beta_1|\) units in the \(Y\)-variable (95%: LB, UB).
General interpretation for \(\widehat{\beta}_2\)
For every increase of 1 \(X_2\)-unit in the \(X_2\)-variable, adjusting/controlling for \(X_1\)-variable, there is an expected increase/decrease of \(|\widehat\beta_2|\) units in the \(Y\)-variable (95%: LB, UB).
Interpretation for \(\widehat{\beta}_0\)
The expected life expectancy is 33.595 years when the female literacy rate is 0% and food supply is 0 0 kcal PPD (95% CI: 24.674, 41.517).
Interpretation for \(\widehat{\beta}_1\)
For every 1% increase in the female literacy rate, adjusting for food supply, there is an expected increase of 0.157 years in the life expectancy (95%: 0.093, 0.221).
Interpretation for \(\widehat{\beta}_2\)
For every 1 kcal PPD increase in the food supply, adjusting for female literacy rate, there is an expected increase of 0.008 years in life expectancy (95%: 0.005, 0.012).
Units of Y
Units of X
Discussing intercept: Mean or average or expected before Y
Discussing coefficient for continuous covariate: Mean or average or expected before difference, increase, or decrease
Confidence interval
If other covariates in the model
Discussing intercept: Must state that variables are equal to 0
Discussing coefficient for covariate: Must state “adjusting for all other variables”, “Controlling for all other variables”, or “Holding all other variables constant”
https://www.writerswrite.co.za/foreshadowing/
The square of a \(t\)-distribution with \(df = \nu\) is an \(F\)-distribution with \(df = 1, \nu\)
\[T_{\nu}^2 \sim F_{1,\nu}\]
We can think about the hypothesis test for the slope…
Null \(H_0\)
\(\beta_1=0\)
Alternative \(H_1\)
\(\beta_1\neq0\)
in a slightly different way…
Null model (\(\beta_1=0\))
Alternative model (\(\beta_1\neq0\))
In multiple linear regression, we can start using this framework to test multiple coefficient parameters at once
Decide whether or not to reject the smaller reduced model in favor of the larger full model
Cannot do this with the t-test!
We can create a hypothesis test for more than one coefficient at a time…
Null \(H_0\)
\(\beta_1=\beta_2=0\)
Alternative \(H_1\)
\(\beta_1\neq0\) and/or \(\beta_2\neq0\)
in a slightly different way…
Null model
Alternative* model
*This is not quite the alternative, but if we reject the null, then this is the model we move forward with
Overall test
Does at least one of the covariates/predictors contribute significantly to the prediction of Y?
Test for addition of a single variable (covariate subset test)
Does the addition of one particular covariate add significantly to the prediction of Y achieved by other covariates already present in the model?
Test for addition of group of variables (covariate subset test)
Does the addition of some group of covariates add significantly to the prediction of Y achieved by other covariates already present in the model?
\[\begin{aligned} \sum_{i=1}^n (Y_i - \overline{Y})^2 &= \sum_{i=1}^n (\widehat{Y}_i- \overline{Y})^2 + \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 \\ SSY &= SSR + SSE \end{aligned}\]
Let’s create a data frame of each component within the SS’s
Using our simple linear regression model as an example:
slr1 = lm(LifeExpectancyYrs ~ FemaleLiteracyRate, data = gapm_sub)
aug_slr1 = augment(slr1)
SS_df = gapm_sub %>% select(LifeExpectancyYrs) %>%
mutate(SSY_diff = LifeExpectancyYrs - mean(LifeExpectancyYrs),
y_fit = aug_slr1$.fitted,
SSR_diff = y_fit - mean(LifeExpectancyYrs),
SSE_diff = aug_slr1$.resid)
SSY_plot = ggplot(SS_df, aes(SSY_diff)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35)
SSR_plot = ggplot(SS_df, aes(SSR_diff)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35)
SSE_plot = ggplot(SS_df, aes(SSE_diff)) + geom_histogram() + xlim(-30, 30) + ylim(0, 35)
grid.arrange(SSY_plot, SSR_plot, SSE_plot, nrow = 3)
\[SSY = \sum_{i=1}^n (Y_i - \overline{Y})^2 = 64.64\]
\[SSR = \sum_{i=1}^n (\widehat{Y}_i- \overline{Y})^2 = 27.24\]
\[SSE =\sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = 37.39\]
\[ F = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \]
New population model for example:
\[\text{Life expectancy} = \beta_0 + \beta_1 \text{Female literacy rate} + \beta_2 \text{Food supply} + \epsilon\]
# Fit regression model:
mr1 <- lm(LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD,
data = gapm_sub)
tidy(mr1, conf.int=T) %>% gt() %>% tab_options(table.font.size = 35) %>% fmt_number(decimals = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 33.595 | 4.472 | 7.512 | 0.000 | 24.674 | 42.517 |
FemaleLiteracyRate | 0.157 | 0.032 | 4.873 | 0.000 | 0.093 | 0.221 |
FoodSupplykcPPD | 0.008 | 0.002 | 4.726 | 0.000 | 0.005 | 0.012 |
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{Life expectancy}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{Female literacy rate} + \widehat{\beta}_2 \text{Food supply} \\ \widehat{\text{Life expectancy}} &= 33.595 + 0.157\ \text{Female literacy rate} + 0.008\ \text{Food supply} \end{aligned}\]
Does at least one of the covariates/predictors contribute significantly to the prediction of Y?
We can create a hypothesis test for all the covariate coefficients…
Null \(H_0\)
\(\beta_1=\beta_2= \ldots=\beta_k=0\)
Alternative \(H_1\)
At least one \(\beta_j\neq0\) (for \(j=1, 2, \ldots, k\))
Null / Smaller / Reduced model
\(Y = \beta_0 + \epsilon\)
Alternative / Larger / Full model
\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon\)
Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k\) and denominator \(df=n-k-1\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} = \frac{MSR_{full}}{MSE_{full}}\]
We are generally calculating: \(P(F_{k, n-k-1} > F)\)
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/insufficient) evidence that at least one predictor’s coefficient is not 0 (p-value = \(P(F_{1, n-2} > F)\)).
Our proposed population model
\[\text{LE} = \beta_0 + \beta_1 \text{FLR} + \beta_2 \text{FS} + \epsilon\]
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{LE}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{FLR} + \widehat{\beta}_2 \text{FS} \\ \widehat{\text{LE}} &= 33.595 + 0.157\ \text{FLR} + 0.008\ \text{FS} \end{aligned}\]
Our main question for the Overall F-test: Is the regression model containing female literacy rate and food supply useful in estimating countries’ life expectancy?
Null / Smaller / Reduced model
\(LE = \beta_0 + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \epsilon\)
mod_red1 = lm(LifeExpectancyYrs ~ 1, data = gapm_sub)
aug_red1 = augment(mod_red1)
mod_full1 = lm(LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD,
data = gapm_sub)
aug_full1 = augment(mod_full1)
SS_df2 = gapm_sub %>% select(LifeExpectancyYrs) %>%
mutate(SSY_diff_r1 = LifeExpectancyYrs - mean(LifeExpectancyYrs),
SSR_diff_r1 = aug_red1$.fitted - mean(LifeExpectancyYrs),
SSE_diff_r1 = aug_red1$.resid,
SSY_diff_f1 = LifeExpectancyYrs - mean(LifeExpectancyYrs),
SSR_diff_f1 = aug_full1$.fitted - mean(LifeExpectancyYrs),
SSE_diff_f1 = aug_full1$.resid)
Reduced / null model \[LE = \beta_0 + \epsilon\]
\[SSY = 64.64\]
\[SSR = 0\]
\[SSE = 64.64\]
Full / Alternative model \[LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \epsilon\]
\[SSY = 64.64\]
\[SSR = 36.39\]
\[SSE = 28.25\]
Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k =2\) and denominator \(df=n-k-1 = 72 - 2-1=69\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}=44.443\] OR use ANOVA table:
anova(mod_red1, mod_full1) %>% tidy() %>% gt() %>% tab_options(table.font.size = 35) %>% fmt_number(decimals = 3)
term | df.residual | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|---|
LifeExpectancyYrs ~ 1 | 71.000 | 4,589.119 | NA | NA | NA | NA |
LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD | 69.000 | 2,005.556 | 2.000 | 2,583.563 | 44.443 | 0.000 |
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that either countries’ female literacy rate or the food supply (or both) contributes significantly to the prediction of life expectancy (p-value < 0.001).
Does the addition of one particular covariate of interest add significantly to the prediction of Y achieved by other covariates already present in the model?
We can create a hypothesis test for a single \(j\) covariate coefficient (where \(j\) can be any value \(1, 2, \ldots, k\))…
Null \(H_0\)
\(\beta_j=0\)
Alternative \(H_1\)
\(\beta_j\neq0\)
Null / Smaller / Reduced model
\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \epsilon\)
Alternative / Larger / Full model
\(\begin{aligned}Y = &\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_j X_j +\\ &\ldots + \beta_k X_k + \epsilon \end{aligned}\)
Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k\) and denominator \(df=n-k-1\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\]
We are generally calculating: \(P(F_{k, n-k-1} > F)\)
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/insufficient) evidence that predictor/covariate \(j\) significantly improves the prediction of Y, given all the other covariates are in the model (p-value = \(P(F_{1, n-2} > F)\)).
Our proposed population model
\[\text{LE} = \beta_0 + \beta_1 \text{FLR} + \beta_2 \text{FS} + \epsilon\]
Fitted multiple regression model:
\[\begin{aligned} \widehat{\text{LE}} &= \widehat{\beta}_0 + \widehat{\beta}_1 \text{FLR} + \widehat{\beta}_2 \text{FS} \\ \widehat{\text{LE}} &= 33.595 + 0.157\ \text{FLR} + 0.008\ \text{FS} \end{aligned}\]
Our main question for the single covariate subset F-test: Is the regression model containing food supply improve the estimation of countries’ life expectancy, given female literacy rate is already in the model?
Null / Smaller / Reduced model
\(LE = \beta_0 + \beta_1 FLR + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \epsilon\)
Reduced / null model \[LE = \beta_0 + \beta_1 FLR + \epsilon\]
\[SSY = 64.64\]
\[SSR = 27.24\]
\[SSE = 37.39\]
Full / Alternative model \[LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \epsilon\]
\[SSY = 64.64\]
\[SSR = 36.39\]
\[SSE = 28.25\]
Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k =2\) and denominator \(df=n-k-1 = 72 - 2-1=69\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\] ANOVA table:
anova(mod_red2, mod_full2) %>% tidy() %>% gt() %>% tab_options(table.font.size = 35) %>% fmt_number(decimals = 3)
term | df.residual | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|---|
LifeExpectancyYrs ~ FemaleLiteracyRate | 70.000 | 2,654.875 | NA | NA | NA | NA |
LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD | 69.000 | 2,005.556 | 1.000 | 649.319 | 22.339 | 0.000 |
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that countries’ food supply contributes significantly to the prediction of life expectancy, given that female literacy rate is already in the model (p-value < 0.001).
Does the addition of some group of covariates of interest add significantly to the prediction of Y obtained through other independent variables already present in the model?
We can create a hypothesis test for a group of covariate coefficients (subset of many)… For example…
Null \(H_0\)
\(\beta_1=\beta_3 =0\) (this can be any coefficients)
Alternative \(H_1\)
At least one \(\beta_j\neq0\) (for \(j=2,3\))
Null / Smaller / Reduced model
\(Y = \beta_0 + \beta_2 X_2 + \epsilon\)
Alternative / Larger / Full model
\(Y = \beta_0 + \beta_1 X + \beta_2 X + \beta_3 X_3+\epsilon\)
For example:
\[\begin{align} H_0 &: \beta_1 = \beta_3 = 0\\ \text{vs. } H_A&: \text{At least one } \beta_j\neq0, \text{for }j=1,3 \end{align}\]Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k\) and denominator \(df=n-k-1\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\]
We are generally calculating: \(P(F_{k, n-k-1} > F)\)
We (reject/fail to reject) the null hypothesis at the \(100\alpha\%\) significance level. There is (sufficient/insufficient) evidence that predictors/covariates \(2,3\) significantly improve the prediction of Y, given all the other covariates are in the model (p-value = \(P(F_{1, n-2} > F)\)).
Our proposed population model to include water source percent (WS):
\[\text{LE} = \beta_0 + \beta_1 \text{FLR} + \beta_2 \text{FS} + \beta_3 WS + \epsilon\]
Our main question for the group covariate subset F-test: Is the regression model containing food supply and water source percent improve the estimation of countries’ life expectancy, given percent female literacy rate is already in the model?
Null / Smaller / Reduced model
\(LE = \beta_0 + \beta_1 FLR + \epsilon\)
Alternative / Larger / Full model
\(LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \beta_3 WS + \epsilon\)
Reduced / null model \[LE = \beta_0 + \beta_1 FLR + \epsilon\]
\[SSY = 64.64\]
\[SSR = 27.24\]
\[SSE = 37.39\]
Full / Alternative model \[LE = \beta_0 + \beta_1 FLR + \beta_2 FS + \beta_3 WS + \epsilon\]
\[SSY = 64.64\]
\[SSR = 43.26\]
\[SSE = 21.38\]
Often we use \(\alpha = 0.05\)
The test statistic is \(F\), and follows an F-distribution with numerator \(df=k =2\) and denominator \(df=n-k-1 = 72 - 2-1=69\). (\(n\) = # obversation, \(k\) = # covariates)
The calculated test statistic is
\[F^ = \dfrac{\frac{SSE(R) - SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}}\] ANOVA table:
anova(mod_red3, mod_full3) %>% tidy() %>% gt() %>% tab_options(table.font.size = 35) %>% fmt_number(decimals = 3)
term | df.residual | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|---|
LifeExpectancyYrs ~ FemaleLiteracyRate | 70.000 | 2,654.875 | NA | NA | NA | NA |
LifeExpectancyYrs ~ FemaleLiteracyRate + FoodSupplykcPPD + WaterSourcePrct | 68.000 | 1,517.916 | 2.000 | 1,136.959 | 25.467 | 0.000 |
We reject the null hypothesis at the 5% significance level. There is sufficient evidence that countries’ food supply or water source (or both) contribute significantly to the prediction of life expectancy, given that female literacy rate is already in the model (p-value < 0.001).
Single covariate subset F-test
Group covariate subset F-test
MLR 2