Week 3

SLR: Hypothesis Testing and Evaluation
Published

January 22, 2024

Modified

January 22, 2024

Room Locations for the week
  • On Monday, 1/22,        we will be in RLSB 3A003 B
  • On Wednesday, 1/24,  we will be in RLSB 3A003 A

Resources

Below is a table with links to resources. Icons in orange mean there is an available file link.

Lesson Topic Slides Annotated Slides Recording
3

Simple Linear Regression

continued

4 SLR: Inference and Prediction

5 SLR: More Inference

Poll Everywhere Questions

Quiz 1 Information

  • We will be in RLSB 3A003 B!!!

  • General structure

    • It will be a maximum of 15 questions

      • ~ 10 multiple choice questions (including T/F)
      • ~ 3 free response questions
  • What will it cover?

    • Lesson 2 (Data Management) to Lesson 4 (SLR: Inference, except the mean response)
      • So up to what we covered on Monday 1/22
    • HW 0 - 1
  • What can you expect?

    • Mostly concept questions
    • You may need to recognize what certain, important functions do
    • You may need to recognize a number from R output (like the regression table on slide 4 in Lesson 4 slides)
  • Instructions that will be on the quiz:

    1. I have written a “30 minute” quiz. However, you have 50 minutes from 2:00 - 2:50pm.

    2. The quiz is open book and open notes. You may use books other than the class textbook, you may use anything on our course webpage, and you may use reference websites (like Wikipedia, Googling expected value of specific distribution, etc.).

    3. No cheating will be tolerated. Cheating includes:

      • Using ChatGPT

      • Using question and answer threads typically seen on sites like StackExchange, WikiHow, Quora, Reddit, StackOverflow, Chegg, etc.

      • Asking other students in the room or looking at other students’ quiz work.

On the Horizon

Class Exit Tickets

Monday (1/22)

Wednesday (1/24)

Announcements

Monday 1/22

  • Clarification on the homework

    • You can always turn in homework late for credit

    • The only way to secure feedback from the TAs is by turning it in on time (unless you and I have discussed something else)

    • Not the same for the lab!

  • HW 1: changes some wording in Question 6, Part f and g (less work to do! Yay!)

Wednesday 1/24

  • HW 1: You do not need to do Question 6f!!
  • While grading Lab 1
    • I noticed some really good answers that might have sources attached to them
    • It was not required that you cite resources in your lab, but it will be required in the project report
    • Just wanted to let you know before some of your found sources leave your mind
      • They might by the end of the quarter
  • Student Leadership Council has set this year’s OHSU-PSU SPH National Public Health Week Conference for Thursday, April 4th, 2024.
    • Present your research in a low pressure situation!

    • You can even present on your project from this class!

      • I can help you polish it up!

Muddiest Points

1. The lecture materials don’t always feel like they apply to the homework. If asked to “state the linear regression models,” are we just running lm()?

Stating the linear regression model is asking us to show the population model that we are fitting. This is just to make sure we are aware of the model that we plan to fit. So the generic form of this is: \(Y = \beta_0 + \beta_1 X + \epsilon\).

Running lm() is the equivalent of fitting the model.

Keep letting me know what feels disconnected in the class! Sometimes I purposefully say things in different ways to build our understanding, but sometimes that fails!

2. “all of the different manifestations of t”

I love the way this person said it!

So I’ve sorted this out:

  • We say \(T\) follows a t-distribution

    • \(T\) is the general name for the variable (like \(X\) or \(Y\))
  • We calculate a given \(t\)-value and call that \(t\)

    • We also call this the test statistic
  • The critical value that corresponds to a specific confidence interval and \(\alpha\) is labelled \(t^*\)

3. What’s the difference between SD and variance?

SD (standard deviation) is the square root of the variance. That’s why I sometimes write \(\sigma\) (standard deviation) or \(\sigma^2\) (variance) when I’m talking about the distribution of residuals.

\[ \sigma = \sqrt{\sigma^2} \]

Variance is usually easier to work with mathematically, but standard deviation is in the units that match a variable. For example, the variance of 10 height measurements are in square inches, but the standard deviation are in inches.

4. Why is it important to test if \(\beta_1\) is equal to zero? Is \(\beta_1=0\) the same as the x and y variables having no correlation?

Let’s answer the second question: Yes! It is the same in simple linear regression. When we get to multiple linear regression, and have several variables/coefficients in our model, testing \(\beta_1=0\) won’t be the same as testing the correlation.

In simple linear regression, it is important to test \(\beta_1\) mostly for pedagogical reasons. It’s just helpful to establish the process in a simpler setting.

5. SSE and sigma

We were looking at the relationship between SSE and \(\widehat\sigma^2\):

\[ \widehat{\sigma}^2 = \frac{1}{n-2}SSE \]

The sum of square errors is \(SSE = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n \epsilon_i^2\)

An aside on variance

The definition of variance is the sum of the squared differences between values and their mean.

So if I had a variable \(S\), with 100 observations, the mean of \(S\), which we call \(\overline{S}\), would be \(\frac{\sum_{i=1}^{100} S_i}{100}\). The variance of \(S\) would be \(\sum_{i=1}^{100} (S_i - \overline{S})^2\).

Now, let’s get back to the sum of square errors: \(SSE = \sum_{i=1}^n \epsilon_i^2\)

The variance of the residuals would be \(\sum_{i=1}^n (\epsilon_i - \overline{\epsilon})^2\). The mean of \(\epsilon\), \(\overline\epsilon\), should be 0 by our assumptions. So the variance of the residuals is \(\sum_{i=1}^n \epsilon_i^2\) which is our SSE!

There is some more complicated math that goes into why our variance is divided by n-2 to get the estimated variance of the residuals, but that’s basically it!

6. It would be helpful to get some clarification on the notations that we need to use in this class. Perhaps, a chart or the types of notations for this class could be helpful to organize this?

I can certainly work on this! In all honesty, I am a little overwhelmed with work this week, so I don’t think this is something I can produce by the quiz. If you want to get one started, I can share it! I think would be a really good thing to help you study, too!

7. What is the relationship between the ANOVA for linear regression and ANOVA for group differences?

8. The limitations of the different tests, like an F statistic vs. t-test

9. unexplained vs the explained

10. Changing the confidence level in tidy()

Here is a good site about the input! Looks like we would use conf.level to change 95% confidence interval to some other percent.