| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 60.041 | 2.056 | 29.207 | 0.000 |
| cell_phones_100 | 0.094 | 0.017 | 5.546 | 0.000 |
2026-02-02
Describe the model assumptions made in linear regression using ordinary least squares
Determine if the relationship between our sampled X and Y is linear
Use QQ plots to determine if our fitted model holds the normality assumption
Use residual plots to determine if our fitted model holds the equality of variance assumption


Model Selection
Building a model
Selecting variables
Prediction vs interpretation
Comparing potential models
Model Fitting
Find best fit line
Using OLS in this class
Parameter estimation
Categorical covariates
Interactions
Model Evaluation
Model Use (Inference)
We have been looking at the association between life expectancy and cell phones
We used OLS to find the coefficient estimates of our best-fit line
Population model:
\[\begin{aligned} Y &= \beta_0 + \beta_1 \cdot X + \epsilon \\ \text{LE} &= \beta_0 + \beta_1 \text{CP} + \epsilon \end{aligned}\]Estimated model:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 60.041 | 2.056 | 29.207 | 0.000 |
| cell_phones_100 | 0.094 | 0.017 | 5.546 | 0.000 |

The residuals \(\widehat\epsilon_i\) are the vertical distances between
\[ \widehat\epsilon_i =Y_i - \widehat{Y}_i \text{, for } i=1, 2, ..., n \]

Determine if the relationship between our sampled X and Y is linear
Use QQ plots to determine if our fitted model holds the normality assumption
Use residual plots to determine if our fitted model holds the equality of variance assumption
These are the model assumptions made in ordinary least squares:
[L] Linearity of relationship between variables
[I] Independence of the \(Y\) values
[N] Normality of the \(Y\)’s given \(X\) (or residuals)
[E] Equality of variance of the residuals (homoscedasticity)
Note: These assumptions are baked into the population model. We look at the population parameters when we discuss these assumptions, but we use the estimated model with our data to check if the assumptions are held up.
\[\widehat{Y}|X = \beta_0 + \beta_1 \cdot X\]

The \(Y\)-values are statistically independent of one another
Examples of when they are not independent, include
repeated measures (such as baseline, 3 months, 6 months)
data from clusters, such as different hospitals or families
This condition is checked by reviewing the study design and not by inspecting the data
\[\epsilon \sim N(0, \sigma^2)\]

The variance of \(Y\) given \(X\) (\(\sigma_{Y|X}^2\)), is the same for any \(X\)
This is also called homoscedasticity
\[\epsilon \sim N(0, \sigma^2)\]
The distribution of \(Y\) given \(X\) is
This means that the residuals are
In mathematical form:
\(Y_i|X \overset{\text{i.i.d.}}{\sim} N(\beta_0 + \beta_1X, \sigma^2)\)
\(\epsilon_i \overset{\text{i.i.d.}}{\sim} N(0, \sigma^2)\)
where “iid” means independent and identically distributed
[L] Linearity of relationship between variables
Check if there is a linear relationship between the mean response (Y) and the explanatory variable (X)
[I] Independence of the \(Y\) values
Check that the observations are independent
[N] Normality of the \(Y\)’s given \(X\) (residuals)
Check that the responses (at each level X) are normally distributed
[E] Equality of variance of the residuals (homoscedasticity)
Check that the variance (or standard deviation) of the responses is equal for all levels of X
Use QQ plots to determine if our fitted model holds the normality assumption
Use residual plots to determine if our fitted model holds the equality of variance assumption
Is the association between the variables linear?
Describe the model assumptions made in linear regression using ordinary least squares
Determine if the relationship between our sampled X and Y is linear
Diagnostic tools:
Distribution plots of residuals
QQ plots of residuals
augment() function from the broom package.Rows: 105
Columns: 8
$ life_exp <dbl> 62.64, 76.07, 73.41, 75.37, 73.66, 71.37, 63.96, 75.47…
$ cell_phones_100 <dbl> 56.2655, 98.3950, 195.6250, 131.4840, 130.5400, 107.50…
$ .fitted <dbl> 65.32037, 69.27372, 78.39761, 72.37873, 72.29015, 70.1…
$ .resid <dbl> -2.6803652, 6.7962791, -4.9876074, 2.9912674, 1.369850…
$ .hat <dbl> 0.038747119, 0.012168777, 0.059882210, 0.011325165, 0.…
$ .sigma <dbl> 5.987137, 5.954886, 5.971571, 5.985846, 5.991701, 5.99…
$ .cooksd <dbl> 4.234809e-03, 8.096656e-03, 2.369189e-02, 1.457236e-03…
$ .std.resid <dbl> -0.45838569, 1.14653081, -0.86249588, 0.50441083, 0.23…
Note that below I save each figure as an object, and then combine them together in one row of output using grid.arrange() from the gridExtra package

Normal
Uniform
T
Skewed


Normal
Uniform
T
Skewed


Normal
Uniform
T
Skewed


Goodness-of-fit test for the normal distribution: Is there evidence that our residuals are from a normal distribution?
Honestly: I don’t use this test very often in practice
Hypothesis test:
\[\begin{aligned} H_0 & : \text{data are from a normally distributed population} \\ H_1 & : \text{data are NOT from a normally distributed population} \end{aligned}\]
Describe the model assumptions made in linear regression using ordinary least squares
Determine if the relationship between our sampled X and Y is linear
Use QQ plots to determine if our fitted model holds the normality assumption


autoplot() can be a helpful tool| Assumption | What needs to hold? | Diagnostic tool |
|---|---|---|
Linearity \(\text{}\) |
|
\(\text{}\) |
Independence \(\text{}\) |
|
\(\text{}\) |
Normality \(\text{}\) |
|
|
Equality of variance \(\text{}\) |
|
\(\text{}\) |
Lesson 7: SLR 4