Week 2

Simple Linear Regression

Published

January 15, 2024

Modified

January 11, 2023

Resources

Lesson	Topic	Slides	Annotated Slides	Recording
3	Simple Linear Regression

Simple Linear Regression

For the slides, once they are opened, if you would like to print or save them as a PDF, the best way to do this is:

Click on the icon with three horizontal bars on the bottom left of the browser.
Click on “Tools” with the gear icon at the top of the sidebar.
Click on “PDF Export Mode.”
From there, you can print or save the PDF as you would normally from your internet browser.

On the Horizon

Announcements

Wednesday 1/17

Our physical classroom space will be changing…
- It’s a little confusing - our time will be split between three classrooms in the RLSB
- 2 are right next to each other
- To start, our classes for next week are in:
  - on Monday, 1/22: RLSB 3A003B
  - on Wednesday, 1/22: RLSB 3A003A
HW 1 IS NOT DUE THIS WEEK!!! This is my mistake!!
- Homework 1 is due 1/25!!
- The finalized HW1 is finally up! Thank you for your patience!
Muddiest points for Week 1 are added
Office hours starting this week!
- First one is today at 4:30 with Antara
If you are in 612, the reading assignments are posted
Wanted to clear something up about attendance
- If you miss the exit tickets for less than or equal to 5 classes, your grade will not be impacted
- If you miss more than 5 exit tickets, then your attendance grade will be affected
Any questions on the lab? (10 minutes)

Class Exit Tickets

Wednesday (1/17)

Muddiest Points

1. What does the epsilon mean and how does it relate to the line in the linear model?

\(\epsilon\) is our error term, our residual. It is the difference between our observed value \(Y\) and the expected value of \(Y\) given \(X\). It’s a mathematical way to represent the fact that not every oberved \(Y\) value directly falls on our line. \(\epsilon\) is the difference between our line and our observed value for \(Y\).

2. Different betas and stuff: make the table for the class!! and epsilon

Below is a table that I started to construct with a student after class. We often use the model or the line to represent linear regression. When we refer to the model, most people think of the row named model. The line is just another way to represent the model. Remember that \(\epsilon = Y - E(Y|X)\) and \(\widehat\epsilon = Y - \widehat{E}(Y|X)\). Try substituting \(\epsilon = Y - E(Y|X)\) into the population model \(Y = \beta_0 + \beta_1X + \epsilon\). Does it simplify to the population line?

I think it can help a lot with this confusion.

	Population	Estimated
Model	\[Y = \beta_0 + \beta_1X + \epsilon\]	\[Y = \widehat{\beta}_0 + \widehat{\beta}_1 X + \widehat\epsilon \]
Line	\[E(Y\|X) = \beta_0 + \beta_1 X \] OR \[\mu_Y = \beta_0 + \beta_1 X \]	\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X \] OR \[ \widehat{E}[Y\|X] = \widehat{\beta}_0 + \widehat{\beta}_1X \] OR \[ \widehat{E[Y\|X]} = \widehat{\beta}_0 + \widehat{\beta}_1X \]

Model

\[Y = \beta_0 + \beta_1X + \epsilon\]

\[Y = \widehat{\beta}_0 + \widehat{\beta}_1 X + \widehat\epsilon \]

Line

\[E(Y|X) = \beta_0 + \beta_1 X \] OR

\[\mu_Y = \beta_0 + \beta_1 X \]

\[\widehat{Y} = \widehat{\beta}_0 + \widehat{\beta}_1 X \]

\[ \widehat{E}[Y|X] = \widehat{\beta}_0 + \widehat{\beta}_1X \]

\[ \widehat{E[Y|X]} = \widehat{\beta}_0 + \widehat{\beta}_1X \]

2.1 Someone else asked: Why does the population model have an error term epsilon in the equation but the estimated line does not?

I think this is referring to this slide. This was because I wanted to put the population model next to the estimated line. I realize this is very confusing. Both estimated and population models can be represented as the lines and models in the above table.

2.2 Someone else asked: Why does the population equation even matter?

Huh, I’m scratching my head with this one. Why does it matter? We basically mirror all the mathematical manipulations with the estimated model anyway…

But then I thought: What would our world or our class lectures look like without the population model? The answer might be more philosophical than mathematical. The representation of the true, underlying model that we are aspiring for with our sample data reminds us that our estimated model is not perfect. That we are just trying out best to uncover some fraction of the truth. And at the end of the day, when we perform hypothesis tests, we’re working to provide evidence fro the value of the coefficient parameters from the population model. We know what the estimated values are, but can they help us get an idea of what the parameter values are?

3. Math for minimizing SSE (aka OLS process)

I am very sorry that this math was intimidating! Most of us don’t need to see the math, but there are a handful of students that should see it, and get a sense of the underlying math. Just wanted to make sure they saw it!

The important things for us to know is the information on the slide for Step 1 and 2, where we talk about the process itself. If I asked you why we minimize the SSE with respect to our coefficients, would you be able to answer?