2024-12-09
Identify the simple linear regression model and define statistics language for key notation
Illustrate how ordinary least squares (OLS) finds the best model parameter estimates
Apply OLS in R for simple linear regression of real data
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
Average life expectancy vs. female literacy rate
\[\widehat{\text{life expectancy}} = 50.9 + 0.232\cdot\text{female literacy rate}\]
ggplot(gapm, aes(x = female_literacy_rate_2011,
y = life_expectancy_years_2011)) +
geom_point(size = 4) +
geom_smooth(method = "lm", se = FALSE, size = 3, colour="#F14124") +
labs(x = "Female literacy rate (%)",
y = "Life expectancy (years)",
title = "Relationship between life expectancy and \n the female literacy rate in 2011") +
theme(axis.title = element_text(size = 30),
axis.text = element_text(size = 25),
title = element_text(size = 30))
Data files
Cleaned: lifeexp_femlit_2011.csv
Needs cleaning: lifeexp_femlit_water_2011.csv
Data were downloaded from Gapminder
2011 is the most recent year with the most complete data
Life expectancy = the average number of years a newborn child would live if current mortality patterns were to stay the same.
Adult literacy rate is the percentage of people ages 15 and above who can, with understanding, read and write a short, simple statement on their everyday life.
Rows: 188
Columns: 3
$ country <chr> "Afghanistan", "Albania", "Algeria", "Andor…
$ life_expectancy_years_2011 <dbl> 56.7, 76.7, 76.7, 82.6, 60.9, 76.9, 76.0, 7…
$ female_literacy_rate_2011 <dbl> 13.0, 95.7, NA, NA, 58.6, 99.4, 97.9, 99.5,…
life_expectancy_years_2011 female_literacy_rate_2011
Min. :47.50 Min. :13.00
1st Qu.:64.30 1st Qu.:70.97
Median :72.70 Median :91.60
Mean :70.66 Mean :81.65
3rd Qu.:76.90 3rd Qu.:98.03
Max. :82.90 Max. :99.80
NA's :1 NA's :108
\[\widehat{\text{life expectancy}} = 50.9 + 0.232\cdot\text{female literacy rate}\]
Illustrate how ordinary least squares (OLS) finds the best model parameter estimates
Apply OLS in R for simple linear regression of real data
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
The (population) regression model is denoted by:
\[Y = \beta_0 + \beta_1X + \epsilon\]
Unobservable population parameters
\(\beta_0\) and \(\beta_1\) are unknown population parameters
\(\epsilon\) (epsilon) is the error about the line
It is assumed to be a random variable with a…
Normal distribution with mean 0 and constant variance \(\sigma^2\)
i.e. \(\epsilon \sim N(0, \sigma^2)\)
Observable sample data
\(Y\) is our dependent variable
\(X\) is our independent variable
The (population) regression model is denoted by:
\[Y = \beta_0 + \beta_1X + \epsilon\]
Component | Type | Name |
\(Y\) | Observed | response, outcome, dependent variable |
\(\beta_0\) | Pop. parameter | intercept |
\(\beta_1\) | Pop. parameter | slope |
\(X\) | Observed | predictor, covariate, independent variable |
\(\epsilon\) | Pop. parameter | residuals, error term |
Note: the population model is the true, underlying model that we are trying to estimate using our sample data
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X \]
\[Y = \beta_0 + \beta_1X + \epsilon\]
\(Y\) | response, outcome, dependent variable |
\(\beta_0\) | intercept |
\(\beta_1\) | slope |
\(X\) | predictor, covariate, independent variable |
\(\epsilon\) | residuals, error term |
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1X\]
\(\widehat{Y}\) | estimated expected response given predictor \(X\) |
\(\widehat{\beta}_0\) | estimated intercept |
\(\widehat{\beta}_1\) | estimated slope |
\(X\) | predictor, covariate, independent variable |
Apply OLS in R for simple linear regression of real data
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
Recall, one characteristic of our population model was that the residuals, \(\epsilon\), were Normally distributed: \(\epsilon \sim N(0, \sigma^2)\)
In our population regression model, we had: \[Y = \beta_0 + \beta_1X + \epsilon\]
We can also take the average (expected) value of the population model
We take the expected value of both sides and get:
\[\begin{aligned} E[Y] & = E[\beta_0 + \beta_1X + \epsilon] \\ E[Y] & = E[\beta_0] + E[\beta_1X] + E[\epsilon] \\ E[Y] & = \beta_0 + \beta_1X + E[\epsilon] \\ E[Y|X] & = \beta_0 + \beta_1X \\ \end{aligned}\]
With observed \(Y\) values and residuals:
\[Y = \beta_0 + \beta_1X + \epsilon\]
With the population expected value of \(Y\) given \(X\):
\[E[Y|X] = \beta_0 + \beta_1X\]
Using the two forms of the model, we can figure out a formula for our residuals:
\[\begin{aligned} Y & = (\beta_0 + \beta_1X) + \epsilon \\ Y & = E[Y|X] + \epsilon \\ Y - E[Y|X] & = \epsilon \\ \epsilon & = Y - E[Y|X] \end{aligned}\]
And so we have our true, population model, residuals!
This is an important fact! For the population model, the residuals: \(\epsilon = Y - E[Y|X]\)
We have the same two representations of our estimated/fitted model:
With observed values:
\[Y = \widehat{\beta}_0 + \widehat{\beta}_1X + \widehat{\epsilon}\]
With the estimated expected value of \(Y\) given \(X\):
\[\begin{aligned} \widehat{E}[Y|X] & = \widehat{\beta}_0 + \widehat{\beta}_1X \\ \widehat{E[Y|X]} & = \widehat{\beta}_0 + \widehat{\beta}_1X \\ \widehat{Y} & = \widehat{\beta}_0 + \widehat{\beta}_1X \\ \end{aligned}\]
Using the two forms of the model, we can figure out a formula for our estimated residuals:
\[\begin{aligned} Y & = (\widehat{\beta}_0 + \widehat{\beta}_1X) + \widehat\epsilon \\ Y & = \widehat{Y} + \widehat\epsilon \\ \widehat\epsilon & = Y - \widehat{Y} \end{aligned}\]
This is an important fact! For the estimated/fitted model, the residuals: \(\widehat\epsilon = Y - \widehat{Y}\)
Observed values for each individual \(i\): \(Y_i\)
Fitted value for each individual \(i\): \(\widehat{Y}_i\)
Residual for each individual: \(\widehat\epsilon_i = Y_i - \widehat{Y}_i\)
We want to minimize the sum of residuals
We can use ordinary least squares (OLS) to do this in linear regression!
Idea behind this: reduce the total error between the fitted line and the observed point (error between is called residuals)
Note: there are other ways to estimate the best-fit line!!
Identify the simple linear regression model and define statistics language for key notation
Illustrate how ordinary least squares (OLS) finds the best model parameter estimates
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Calculate and report the estimate and confidence interval for the population slope \(\beta_1\) (applies to \(\beta_0\) as well)
\[ \begin{aligned} SSE & = \displaystyle\sum^n_{i=1} \widehat\epsilon_i^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - \widehat{Y}_i)^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - (\widehat{\beta}_0+\widehat{\beta}_1X_i))^2 \\ SSE & = \displaystyle\sum^n_{i=1} (Y_i - \widehat{\beta}_0-\widehat{\beta}_1X_i)^2 \end{aligned}\]
Things to use
\(\widehat\epsilon_i = Y_i - \widehat{Y}_i\)
\(\widehat{Y}_i = \widehat\beta_0 + \widehat\beta_1X_i\)
Then we want to find the estimated coefficient values that minimize the SSE!
lm()
lm()
+ summary()
Call:
lm(formula = life_expectancy_years_2011 ~ female_literacy_rate_2011,
data = gapm)
Residuals:
Min 1Q Median 3Q Max
-22.299 -2.670 1.145 4.114 9.498
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.92790 2.66041 19.143 < 2e-16 ***
female_literacy_rate_2011 0.23220 0.03148 7.377 1.5e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.142 on 78 degrees of freedom
(108 observations deleted due to missingness)
Multiple R-squared: 0.4109, Adjusted R-squared: 0.4034
F-statistic: 54.41 on 1 and 78 DF, p-value: 1.501e-10
lm()
+ tidy()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 50.9278981 | 2.66040695 | 19.142898 | 3.325312e-31 |
female_literacy_rate_2011 | 0.2321951 | 0.03147744 | 7.376557 | 1.501286e-10 |
\[\widehat{\text{life expectancy}} = 50.9 + 0.232\cdot\text{female literacy rate}\]
\[\widehat{\text{life expectancy}} = 50.9 + 0.232\cdot\text{female literacy rate}\]
For every increase of 1 unit in the \(X\)-variable, there is an expected increase of, on average, \(\widehat\beta_1\) units in the \(Y\)-variable.
We only say that there is an expected increase and not necessarily a causal increase.
Example: For every 1 percent increase in the female literacy rate, life expectancy increases, on average, 0.232 years.
Identify the simple linear regression model and define statistics language for key notation
Illustrate how ordinary least squares (OLS) finds the best model parameter estimates
Apply OLS in R for simple linear regression of real data
Often, we are curious if the coefficient is 0 or not:
\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]
Often we use \(\alpha = 0.05\)
The test statistic is \(t\), and follows a Student’s t-distribution.
The calculated test statistic for \(\widehat\beta_1\) is
\[t = \frac{ \widehat\beta_1 - \beta_1}{ \text{SE}_{\widehat\beta_1}} = \frac{ \widehat\beta_1}{ \text{SE}_{\widehat\beta_1}}\]
when we assume \(H_0: \beta_1 = 0\) is true.
We are generally calculating: \(2\cdot P(T > t)\)
We (reject/fail to reject) the null hypothesis that the slope is 0 at the \(100\alpha\%\) significiance level. There is (sufficient/insufficient) evidence that there is significant association between (\(Y\)) and (\(X\)) (p-value = \(P(T > t)\)).
The test statistic for a single coefficient follows a Student’s t-distribution
Single coefficient testing can be done on any coefficient, but it is most useful for continuous covariates or binary covariates
We are testing if the slope is 0 or not:
\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]
Often we use \(\alpha = 0.05\)
The test statistic is \(t\), and follows a Student’s t-distribution.
# recall model1_b1 is regression table restricted to b1 row
model1_b1 <-tidy(model1) %>% filter(term == "female_literacy_rate_2011")
model1_b1 %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
female_literacy_rate_2011 | 0.23 | 0.03 | 7.38 | 0.00 |
[1] 7.376557
R
The \(p\)-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true
We know the probability distribution of the test statistic (the null distribution) assuming \(H_0\) is true
Statistical theory tells us that the test statistic \(t\) can be modeled by a \(t\)-distribution with \(df = n-2\).
Option 1: Use pt()
and our calculated test statistic
We reject the null hypothesis that the slope is 0 at the \(5\%\) significance level. There is sufficient evidence that there is significant association between female life expectancy and female literacy rates (p-value < 0.0001).
We are testing if the intercept is 0 or not:
\[\begin{align} H_0 &: \beta_0 = 0\\ \text{vs. } H_A&: \beta_0 \neq 0 \end{align}\]
Often we use \(\alpha = 0.05\)
This is the same as the slope. The test statistic is \(t\), and follows a Student’s t-distribution.
# recall model1_b1 is regression table restricted to b1 row
model1_b0 <-tidy(model1) %>% filter(term == "(Intercept)")
model1_b0 %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 50.93 | 2.66 | 19.14 | 0.00 |
[1] 19.1429
R
pt()
and our calculated test statistic
We reject the null hypothesis that the intercept is 0 at the \(5\%\) significance level. There is sufficient evidence that the intercept for the association between average female life expectancy and female literacy rates is different from 0 (p-value < 0.0001).
Identify the simple linear regression model and define statistics language for key notation
Illustrate how ordinary least squares (OLS) finds the best model parameter estimates
Apply OLS in R for simple linear regression of real data
Using a hypothesis test, determine if there is enough evidence that population slope \(\beta_1\) is not 0 (applies to \(\beta_0\) as well)
Population model
line + random “noise”
\[Y = \beta_0 + \beta_1 \cdot X + \varepsilon\] with \(\varepsilon \sim N(0,\sigma^2)\)
\(\sigma^2\) is the variance of the residuals
Sample best-fit (least-squares) line
\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X \]
Note: Some sources use \(b\) instead of \(\widehat{\beta}\)
We have two options for inference:
\[\begin{align} H_0 &: \beta_1 = 0\\ \text{vs. } H_A&: \beta_1 \neq 0 \end{align}\]
Note: R reports p-values for 2-sided tests
Recall the general CI formula:
\[\widehat{\beta}_1 \pm t_{\alpha, n-2}^* \cdot SE_{\widehat{\beta}_1}\]
To construct the confidence interval, we need to:
Set our \(\alpha\)-level
Find \(\widehat\beta_1\)
Calculate the \(t_{n-2}^*\)
Calculate \(SE_{\widehat{\beta}_1}\)
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
Save values needed for CI:
Use formula to calculate each bound
\[\widehat{\beta}_1 \pm t^*\cdot SE_{\beta_1}\]
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\).
When we report our results to someone else, we don’t usually show them our full hypothesis test
Typically, we report the estimate with the confidence interval
Once we found our CI, we often just write the interpretation of the coefficient estimate:
General statement for population slope inference
For every increase of 1 unit in the \(X\)-variable, there is an expected average increase of \(\widehat\beta_1\) units in the \(Y\)-variable (95%: LB, UB).
where \(t^*\) is the \(t\)-distribution critical value with \(df = n -2\)
tidy(model1, conf.int = T) %>% gt() %>%
tab_options(table.font.size = 40) %>% fmt_number(decimals = 3)
term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|
(Intercept) | 50.928 | 2.660 | 19.143 | 0.000 | 45.631 | 56.224 |
female_literacy_rate_2011 | 0.232 | 0.031 | 7.377 | 0.000 | 0.170 | 0.295 |
General statement for population intercept inference
The expected outcome for the \(Y\)-variable is (\(\widehat\beta_0\)) when the \(X\)-variable is 0 (95% CI: LB, UB).
Lesson 18 Slides