Muddy Points
Lesson 8: SLR Model Diagnostics
Muddy Points from Winter 2026
1. The Q-Q plots are still a bit confusing to me. I understand we want them to fall on the line; however, I have a hard time distinguishing between which ones are too abnormal.
I would check out this muddy point more.
2. Why transform in statistical models if it screws up interpretability?
It doesn’t necessarily screw up interpretability! It depends on the transformation. For example, if we log-transform X, then the interpretation of the coefficient becomes “for a one unit increase in log(X), we expect a beta increase in Y.” This is still interpretable, but we need to be careful about how we phrase it.
The more transformations we do, the harder it becomes to interpret. For example, if we log-transform both X and Y, then the interpretation becomes “for a one unit increase in log(X), we expect a beta increase in log(Y).” This is still interpretable, but it becomes more abstract.
I feel that I have misled us slightly during lecture. Transformations are useful when needed!! I have found that newer analysts tend to jump to transformations as a way to fix any slight issue with the LINE assumptions. I wanted to discourage us from doing that. Transformations should be used when there are clear violations of the LINE assumptions that cannot be fixed with other methods (like adding variables).
3. still feeling confused about the first poll everywhere question - why would we do log(x) and not log(y) in that situation where the graph is looking sort of exponential?
Oh, I think log(Y) is a good option! I would probably start with that, but I think the only answer choices transformations of X.
4. For a model where we wanted to include the X^3 variable, do we need to keep the X^2 in the model as well, or just the base X(^1)?
You need to include both \(X^2\) and \(X^1\) in the model. This is because the higher order terms are conditional on the lower order terms.
5. what is the point of trying to fit a linear regression to data that are not linear? or in what situations would we want to do that?
Linear regression is a very flexible tool. Even if the relationship between X and Y is not linear, we can use “linear” regression. “Linear” refers to the coefficients, not the variables. By transforming X and/or Y, we can still use linear regression to model non-linear relationships.
6. Would we do the transformation on the points? Would the transformation be on the line of best fit?
Transformations are done on the X or Y variables, leading to transformations in the observations (points).
Muddy Points from Winter 2025
1. using gladder()
gladder will show you what the transformation of a single variable looks like. We can use it as a visual assessment to determine which transformations we might want to try for a variable.
I showed it for both FLR and LE. Note that I did it for each variable separately! Then I decided LE and FLR, separately, might benefit from a squared or cubed transformation.
2. When going “up” or “down” the ladder, do we include all the items on the way (I.e., add squared and cubed if we want to get to cubed) or just the one we want in our model?
I suggest trying all the ones on the way. This is why I like gladder(). Instead of making a choice on going “up” or “down,” we can look at all the plots and see how each transformation will help us make the variable more normally distributed.
3. Transformations - in ‘real life’, would you try transforming X alone, Y alone, and X and Y together? Or was that just an example for today’s lesson?
Yes, that is a good order of things! We will talk more about the reasoning for X first when we get to multiple linear regression. The main point is that transforming our covariates (X’s) will not impact the linear relationship between other X’s and the outcome (Y). If we transform Y first, then we need to make sure all X’s have maintained their linear relationship with the transformed Y.
4. Why do we care about transforming data, especially if it is not recommended to use it when explaining to audience?
There are cases where the LINE assumptions are blatantly broken. When there are obvious issues, especially with linearity, then we need to make a transformation.
Some fields typically use transformations because of known properties of the data. For example, gene expression data are often log-transformed. In this case, there is heteroscedasticity inherent in the data that needs to be fixed (giving it homoscedasticity).
5. Are outliers and high leverage points synonymous with one another? I get the general gist that they are values far away from X_bar, but what is the difference between the two?
They are NOT synonymous. Only high leverage points are observations far from \(\overline{X}\). Outliers are observations that do not follow the general trend of the other observations. This means an outlier can be right at \(\overline{X}\), but have a Y-value falls very far from the \(\widehat{Y}\) line.