Muddy Points

Lesson 6: SLR: More inference

Modified

January 28, 2026

Muddy Points from 2026

1. 3 types of tests (ie. Slide 29)

I’ll go over this more as we get into multiple regression

2. understanding degrees of freedom which i brought up in the chat… I dont understand where they come from. I know it isn’t imperative to understand them to conduct these analyses, but id really like to understand it.

Degrees of freedom are a way to account for how much information we have in our data. When we estimate parameters (like means, variances, regression coefficients), we are using up some of that information. The degrees of freedom essentially tell us how many independent pieces of information are left after accounting for the parameters we’ve estimated.

For the F-test, the numerator degrees of freedom correspond to the number of parameters being tested (e.g., the number of predictors in the model) - we’re using up the model coefficients as information for the SSR. The denominator degrees of freedom relate to the amount of data minus the number of estimated parameters - we’re using up the total number of observations without the model coefficients for SSE.

3. you said if variation explained by regression is 0, does not give us info about how Y changes when X changes. i’m confused about that.

If the variation explained by regression is 0, it means that the independent variable (X) does not provide any information about the dependent variable (Y). In other words, changes in X do not lead to any predictable changes in Y.

When we say “variation explained by regression,” we’re referring to how much of the total variability in Y can be accounted for by changes in X. If this value is 0, it indicates that knowing X does not help us understand or predict Y at all. Therefore, we cannot say that Y changes when X changes because there is no relationship between them.

4. I guess I’m still not fully clear on why a t-test can’t be used for multiple covariates.

A t-test can be used on individual coefficients to test if each coefficient is significantly different from zero, but it does not account for the overall model fit when multiple coefficients are involved. So when we have multiple covariates (or coefficients), it cannot answer the question of whether the model explains more variation in the outcome than a model with no predictors.

Muddy Points from 2025

1. I was wondering when you should use variance versus deviation…

This is a very good question! I was using them a little loosely in today’s class. Deviation refers to a single observation’s (i’s) difference between values. So the deviation between the best fit line and an observation is \(Y_i - \widehat{Y_i}\). We often call this specific deviation the residual or error.

Variation refers to the entire sample. We calculate the variation by summing the squared deviations. It will give us a sense of how the sample, as a whole, is spread.

2. “F statistic > 1 is more evidence”. . . what does that mean?

More evidence… that the model explains the variation more than it does not. The larger the F statistic gets (once already > 1), the more and more evidence that the model explains a big portion of the total variation! Always remember that the F statistic measures the portion of explain to unexplained variation!

3. Why do we need to include the confidence interval when asked to interpret coefficients in a model? To see that our model is statistically significant (doesn’t contain the value 0)?

We do not always need to come to a conclusion on the statistical significance. The confidence interval provides a lot of good information about the spread of the data. A coefficient estimate like 10 might seem really powerful, but once you see that the confidence interval spans from 1 to 19 (even though significant!), you might be less inclined to really broadcast the 10 alone. Remember, statistically significant just means there’s evidence that it is not 0, not that we have evidence that it is 10.

My tangent: that’s why I get so frustrated when I read news articles or hear people reporting things like “you are 10x more likely to get blah.” Yeah, 10 is a scary big number, but I want to know if that 10 has a confidence interval of 9 to 11 or 1 to 19. It changes things and let’s me know the quality of the data they were working with!

4. is there a correlation for categorical variables?

Unfortunately no. It only works for two continuous variables

5. What does it mean, in the null hypothesis, that our Beta1 I(america) + Beta2 I(asia) + Beta 3I(europe) equal zero? is that saying they laay exactly on the line of best fit?

Be careful! It did NOT say “+” The null hypothesis is that all the coefficients are 0. That \(\beta_1 = \beta_2 = \beta_3 = 0\). This is just quicker to write than \(\beta_1 = 0, \beta_2 = 0, \beta_3 = 0\).

The null hypothesis is saying that there is no difference between the reference group and all the other groups. Basically that no matter what group, the mean outcome is the same.

6. Also: how do we set up the model to determine the difference in means between successive levels of an ordinal variable (1-2, 2-3, etc.?)

For ordinal variables you can find the mean outcome for each level, then subtract them from each other. For example, mean life expectancy for low income would be \(E(LE|X=1) = \widehat{\beta_0} + \widehat{\beta_1}(1)\) and mean life expectancy for upper middle income would be \(E(LE|X=3) = \widehat{\beta_0} + \widehat{\beta_1}(2)\). And then we can just take the difference: \(E(LE|X=3) - E(LE|X=1)\).

The trick is in the confidence intervals! We need to find the standard error for the difference in expected values! We can start with the variance:

\[ \begin{aligned} Var \big( E(LE|X=3) - & E(LE|X=1) \big)\\ = & Var \big( \widehat{\beta_0} + \widehat{\beta_1}(3) - \widehat{\beta_0} - \widehat{\beta_1}(1) \big)\\ = & Var \big( \widehat{\beta_1}(3) - \widehat{\beta_1}(1) \big)\\ = & Var \big( \widehat{\beta_1}(3-1) \big) \\ = & Var \big( 2 \widehat{\beta_1} \big) \\ = & 4 Var \big( \widehat{\beta_1} \big) \end{aligned} \]

From there we know the standard error of \(\beta_1\) and can calculated the confidence interval.

Muddy Points from 2024

7. What is the relationship between the ANOVA for linear regression and ANOVA for group differences?

8. The limitations of the different tests, like an F statistic vs. t-test

9. unexplained vs the explained

10. Changing the confidence level in `tidy()`

Here is a good site about the input! Looks like we would use conf.level to change 95% confidence interval to some other percent.