2024-04-15
Recognize why the tests we’ve learned so far are not flexible enough for continuous covariates or multiple covariates.
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Identify the simple logistic regression model and define key notation in statistics language
Connect linear and logistic regression to the larger group of models, generalized linear model
Determine coefficient estimates using maximum likelihood estimation (MLE) and apply it in R
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Identify the simple logistic regression model and define key notation in statistics language
Connect linear and logistic regression to the larger group of models, generalized linear model
Determine coefficient estimates using maximum likelihood estimation (MLE) and apply it in R
Question: Is race/ethnicity and/or age associated with an individual’s diagnosed stage of breast cancer?
You can take a look at the Breast Cancer Research Foundation’s page: Understanding Breast Cancer Racial Disparities
Big contributors to racial disparities include:
Our analysis will not be new, but this kind of work has shed light on the importance of focused research on people of color to better serve people of color who develop breast cancer
Contingency table does not work for…
Contingency table does not work for…
Identify the simple logistic regression model and define key notation in statistics language
Connect linear and logistic regression to the larger group of models, generalized linear model
Determine coefficient estimates using maximum likelihood estimation (MLE) and apply it in R
Goal: model the probability of our outcome (\(\pi(X)\)) with the covariate (\(X_1\))
In simple linear regression, we use the model in its various forms: \[\begin{aligned} Y&=\beta_0+\beta_1X_1+\epsilon \\ E[Y|X] &= \beta_0 + \beta_1X_1 \\ \widehat{Y} &= \beta_0 + \beta_1X_1 \end{aligned}\]
Potential problem? Probabilities can only take values from 0 to 1
Outcome: \(Y\) - binary (two-level) categorical variable
Covariate: \(X_1\)
Probability of outcome for individual with observed covariates
Because the expected value is a weighted average, we can say: \[\begin{aligned} E(Y|X) & = P(Y=1|X) \cdot 1 + P(Y=1|X) \cdot 0 \\ & = P(Y=1|X) \cdot 1 \\ & = P(Y=1|X) \\ & = \pi(X) \end{aligned}\]
The (population) regression model is denoted by:
\[Y = \beta_0 + \beta_1X + \epsilon\]
\(Y\) | response, outcome, dependent variable |
\(\beta_0\) | intercept |
\(\beta_1\) | slope |
\(X\) | predictor, covariate, independent variable |
\(\epsilon\) | residuals, error term |
Assumptions of the linear regression model:
Independence: observations are independent
Linearity: linear relationship between \(E[Y|X]\) and \(X\) \[E[Y|X] = \beta_0 + \beta_1 \cdot X\]
Normality and homoscedasticity assumption for residuals (\(\epsilon\)):
Which assumptions are violated if dependent variable is categorical?
The relationship between the variables is linear (a straight line):
The independent variable \(X\) can take any value, while \(\pi(X)\) is a probability that should be bounded by [0,1]
In linear regression, \(\text{var}(\epsilon) = \sigma^2\)
When Y is a binary outcome \[\begin{aligned} \text{var}\left(Y\right) & =\pi\left(1-\pi\right)\\ & = \left(\beta_0+\beta_1X\right)\left(1-\beta_0-\beta_1X\right) \end{aligned}\]
Because variance depends on \(X\): no homoscedasticity
Recognize why the tests we’ve learned so far are not flexible enough for continuous covariates or multiple covariates.
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Connect linear and logistic regression to the larger group of models, generalized linear model
Determine coefficient estimates using maximum likelihood estimation (MLE) and apply it in R
Answer: We need to transform the outcome so we can map differences in covariates to the two levels
The (population) regression model is denoted by:
\[ \text{logit} (\pi) = \beta_0 + \beta_1X\]
\(\pi\) | probability that the outcome occurs (\(Y=1\)) given \(X\) |
\(\beta_0\) | intercept |
\(\beta_1\) | slope |
\(X\) | predictor, covariate, independent variable |
Recognize why the tests we’ve learned so far are not flexible enough for continuous covariates or multiple covariates.
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Identify the simple logistic regression model and define key notation in statistics language
Generalized Linear Models are a class of models that includes regression models for continuous and categorical responses
Here we will focus on the GLMs for categorical/count data
Logistic regression is just a one type of GLM
Poisson regression – for counts
Log-binomial can be used to focus on risk ratio
Basically, we are just identifying the distribution for our outcome
If Y is binary: assumes a binomial distribution of Y
If Y is count: assumes Poisson or negative binomial distribution of Y
If Y is continuous: assumea Normal distribution of Y
Above equation includes:
If \(\mu = E(Y)\), then the link function specifies a function \(g(.)\) that relates \(\mu\) to the linear predictor as: \[g\left(\mu\right)=\beta_0+\beta_1X_1+\ldots+\beta_kX_k\]
The link function connects the random component with the systematic component
Can also think of this as: \[\mu=g^{-1}\left(\beta_0+\beta_1X_1+\ldots+\beta_kX_k\right)\]
Recognize why the tests we’ve learned so far are not flexible enough for continuous covariates or multiple covariates.
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Identify the simple logistic regression model and define key notation in statistics language
Connect linear and logistic regression to the larger group of models, generalized linear model
The (population) regression model is denoted by:
\[ \text{logit} (\pi) = \beta_0 + \beta_1X\]
\(\pi\) | probability that the outcome occurs (\(Y=1\)) given \(X\) |
\(\beta_0\) | intercept |
\(\beta_1\) | slope |
\(X\) | predictor, covariate, independent variable |
Maximum likelihood: yields values for the unknown parameters that maximize the probability of obtaining observed set of data
Within a dataset with n subjects, for the \(i\)th subject:
if \(Y_i=1\), the contribution to the likelihood function is \(\pi\left(X_i\right)\)
if \(Y_i=0\), the contribution to the likelihood function is \(1-\pi\left(X_i\right)\)
Recall
\[l(\beta_0, \beta_1) = \prod_{i=1}^{n}{\pi(X_i)^{Y_i} (1-\pi(X_i)) ^ {1-Y_i}}\]
Recall
Mathematically, it is easier to work with the log likelihood function for maximization
The log likelihood function is: \[\begin{aligned}L\left(\beta_0,\beta_1\right) &=\ln{\left(l\left(\beta_0,\beta_1\right)\right)} \\ & = \sum_{i=1}^{n}\bigg[Y_i\cdot\text{ln}[\pi(X_i)] + (1-Y_i)\cdot\text{ln}[1-\pi(X_i)] \bigg] \end{aligned}\]
Recall
To find \(\beta_0\) and \(\beta_1\) that maximizes \(L\left(\beta_0,\beta_1\right)\):
Such equations are called likelihood equations.
In logistic regression, there is no “closed form” solution to the above equations
Need to use iterative algorithm, such as iteratively reweighted least squares (IRLS) algorithm, should be used to find the MLEs for logistic regression
glm()
function automatically does MLE for you
family
within glm()
to “binomial” which will automatically set the logit linkYou can explore other algorithms (other than IRLS) to maximize the likelihood
glm()
We want to fit: \[\text{logit}(\pi(Age)) = \beta_0 + \beta_1 \cdot Age\]
Call:
glm(formula = Late_stage_diag ~ Age_c, family = binomial, data = bc)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.989422 0.023205 -42.64 <2e-16 ***
Age_c 0.056965 0.003204 17.78 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11861 on 9999 degrees of freedom
Residual deviance: 11510 on 9998 degrees of freedom
AIC: 11514
Number of Fisher Scoring iterations: 4
Translate the results back to an equation!
Just going to pull the coefficients so I have a reference as I create the fitted regression model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9894225 0.0232055 -42.63742 0.000000e+00
Age_c 0.0569645 0.0032039 17.77974 1.014557e-70
We will need to reverse the transformation process in slide 24-25 to find the odds ratios
Translate the results back to an equation!
Just going to pull the coefficients so I have a reference as I create the fitted regression model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9894225 0.0232055 -42.63742 0.000000e+00
Age_c 0.0569645 0.0032039 17.77974 1.014557e-70
We will need to reverse the transformation process in slide 24-25 to find the odds ratios
Recognize why the tests we’ve learned so far are not flexible enough for continuous covariates or multiple covariates.
Recognize why linear regression cannot be applied to categorical outcomes with two levels
Identify the simple logistic regression model and define key notation in statistics language
Connect linear and logistic regression to the larger group of models, generalized linear model
Determine coefficient estimates using maximum likelihood estimation (MLE) and apply it in R
Lesson 5: Simple Logistic Regression