2024-05-13
Use the Pearson residual statistic to determine if our preliminary final model fits the data well
Use the Hosmer and Lemeshow goodness-of-fit statistic to determine if our preliminary final model fits the data well
Use the ROC-AUC to determine how well model predicts binary outcome
Apply AIC and BIC as a summary measure to make additional comparisons between potential models
Outcome variable: any fracture in the first year of follow up (FRACTURE: 0 or 1)
Risk factor/variable of interest: history of prior fracture (PRIORFRAC: 0 or 1)
Potential confounder or effect modifier: age (AGE, a continuous variable)
Fitted model with interactions: \[\begin{aligned} \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = \widehat\beta_0 &+ &\widehat\beta_1\cdot I(\text{PF}) & + &\widehat\beta_2\cdot Age& + &\widehat\beta_3 \cdot I(\text{PF}) \cdot Age \\ \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = -1.376 &+ &1.002\cdot I(\text{PF})& + &0.063\cdot Age& -&0.057 \cdot I(\text{PF}) \cdot Age \end{aligned}\]
For individuals 69 years old, the estimated odds of a new fracture for individuals with prior fracture is 2.72 times the estimated odds of a new fracture for individuals with no prior fracture (95% CI: 1.70, 4.35). As seen in Figure 1 (a), the odds ratio of a new fracture when comparing prior fracture status decreases with age, indicating that the effect of prior fractures on new fractures decreases as individuals get older. In Figure 1 (b), it is evident that for both prior fracture statuses, the predict probability of a new fracture increases as age increases. However, the predicted probability of new fracture for those without a prior fracture increases at a higher rate than that of individuals with a prior fracture. Thus, the predicted probabilities of a new fracture converge at age [insert age here].
Variable selection is no longer our focus at this stage
Assessing the Goodness of Fit or Assessing model fit
Assess how well our fitted logistic regression model predicts/estimates the observed outcomes
Comparison: fitted/estimated outcome vs. observed outcome
We want to measure the absolute fit: not comparing to any models, but determining if the model fits the data well
We want comparable measures of fit: if we have candidate models that are not nested
The model fits the data well if
Summary measures of the distance between the predicted/estimated/fitted and observed Y are small
The contribution of each pair (predicted and observed) to these summary measures is unsystematic and is small relative to the error structure of the model
Need both components
What do we need to calculate them?
Need to define what the fitted outcome is
Need to calculate how close the fitted outcome is to the observed outcome
Summarize across all observations (or individuals’ data)
Two tests of goodness-of-fit
In logistic regression model, we estimate \(\pi(\mathbf{X}) = P(Y=1|\mathbf{X})\)
However, we always observe \(Y=1\) or \(Y=0\)
We can deterimine the fitted outcome by sampling Y’s from a Bernoulli distribution with the fitted probability
If there are groups of individuals that share the same covariate observations, then we can use the same \(\widehat\pi(\mathbf{X})\)
When the logistic regression model contains only categorical covariates, we can think of the number of covariate patterns
For example: model contains two binary covariates (history of fracture and smoking status), there will be 4 unique combination of these factors
We can then compute the predicted number of individuals with Y=1 in each group, and compare that with the actual observed number of individuals with Y=1 in that group
Use the Hosmer and Lemeshow goodness-of-fit statistic to determine if our preliminary final model fits the data well
Use the ROC-AUC to determine how well model predicts binary outcome
Apply AIC and BIC as a summary measure to make additional comparisons between potential models
In logistic regression model, can use Pearson residual for summary measure of goodness-of-fit Uses the \(\widehat{Y}_j\) fitted value from previous slide
Pearson residual for jth covariate pattern is: \[r\left(Y_j,{\hat{\pi}}_j\right)=\frac{(Y_j-m_j{\hat{\pi}}_j)}{\sqrt{m_j{\hat{\pi}}_j(1-{\hat{\pi}}_j)}}=\frac{(Y_j-{\hat{Y}}_i)}{\sqrt{{\hat{Y}}_i(1-{\hat{\pi}}_j)}}\]
The summary statistics of Pearson residual is thus: \[X^2=\sum_{j=1}^{J}{r\left(Y_j,{\hat{\pi}}_j\right)^2}\]
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses: same for all data
Calculate the test statistic and p-value
Write a conclusion to the hypothesis test
I do not see this as the main way to determine goodness of fit… for a binary outcome!
Assume current model has p covariates…
then \(X^2\) (Pearson residual) follows a chi-squared distribution
Only appropriate if the number of covariate patterns is less than the number of observations
Use the ROC-AUC to determine how well model predicts binary outcome
Apply AIC and BIC as a summary measure to make additional comparisons between potential models
If number of covariate patterns is roughly same as the number of observations
However, HL test does not work well if the number of covariate patterns is small
Steps to compute HL test statistic:
The test statistic of Hosmer-Lemeshow goodness-of-fit test is denoted by \(\widehat{C}\), which is obtained by calculating the Pearson chi-squared statistic from the \(g \times 2\) table of observed and estimated expected frequencies \[\hat{C}=\sum_{k=1}^{g}\frac{\left(o_k-n_k^\prime{\bar{\pi}}_k\right)^2}{n_k^\prime{\bar{\pi}}_k(1-{\bar{\pi}}_k)}\]
Let \(c_k\) be the number of covariate patterns in the \(k\)th decile: \[o_k=\sum_{j=1}^{c_k}y_j\] and \[{\bar{\pi}}_k=\sum_{j=1}^{c_k}\frac{m_j{\hat{\pi}}_j}{n_k^\prime}\]
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses: same for all data
Calculate the test statistic and p-value
Write a conclusion to the hypothesis test
Okay, so let’s look at the interaction model from last class \[\text{logit}\left(\pi(\mathbf{X})\right) = \beta_0 + \beta_1\cdot I(\text{PF}) +\beta_2\cdot Age + \beta_3 \cdot I(\text{PF}) \cdot Age\]
We need to fit the model and use a new command:
glow_m3 = glm(fracture ~ priorfrac + age_c + priorfrac*age_c,
data = glow, family = binomial)
library(ResourceSelection)
obs_vals = as.numeric(glow$fracture) -1
fit_vals = fitted(glow_m3)
hoslem.test(obs_vals, fit_vals, g = 10)
Hosmer and Lemeshow goodness of fit (GOF) test
data: obs_vals, fit_vals
X-squared = 6.778, df = 8, p-value = 0.5608
Note to Nicky: do NOT make conclusion yet! In the poll everywhere!
R may give results for the HL test even if it is not appropriate to use it!
This is due to “too much” power in hypothesis testing.
For example, if one has a sample with \(n=10, 000\) (sample size) and \(n_1=1,000\) (number of events) then \(g=500\) groups are suggested
For n > 25000, other methods, such as partitioning data into a developmental data set (with smaller n) and a validation set
They should not be used for variable selection
They are not for model comparison
One should not use the p-value from goodness of fit tests of different models to decide which model is better than the other
Something like AUC-ROC, AIC, or BIC can be used
Use the Pearson residual statistic to determine if our preliminary final model fits the data well
Use the Hosmer and Lemeshow goodness-of-fit statistic to determine if our preliminary final model fits the data well
It is a plot of sensitivity (true positive rate) versus (1-specificity) or false positive rate of fitted binary values
True Positive Rate \(= \dfrac{TP}{TP + FN}\)
False Positive Rate \(= \dfrac{FP}{FP + TN}\)
Area under the ROC curve (AUC ROC) is a reasonable summary of the overall predictive accuracy of the test
The closer the curve follows the left-hand border and top border of the ROC space, the more accurate the test
The closer the curve comes to the 45-degree diagonal line, the less accurate the test
An AUC = 0.5 represents an unhelpful model
library(pROC)
predicted <- predict(glow_m3, glow, type="response")
# define object to plot and calculate AUC
rocobj <- roc(glow$fracture, predicted)
auc <- round(auc(glow$fracture, predicted),4)
#create ROC plot
ggroc(rocobj, colour = 'steelblue',
size = 2, legacy.axes = TRUE) +
ggtitle(paste0('ROC Curve ','(AUC = ',auc,')')) +
theme(text = element_text(size = 23)) +
xlab("False Positive Rate (1 - Specificity)") +
ylab("True Positive Rate (Sensitivity)")
auc
and compare it to other models: good way to pick a model based on predictive power
Randomly pick one individual from fractured group and one from non-fractured outcome group
Use the Pearson residual statistic to determine if our preliminary final model fits the data well
Use the Hosmer and Lemeshow goodness-of-fit statistic to determine if our preliminary final model fits the data well
Use the ROC-AUC to determine how well model predicts binary outcome
Two widely used non-hypothesis testing based measurements that helps select a good model
There is no hypothesis/conclusion testing for the comparison between two models
Measure of fit | Equation | R code |
---|---|---|
Akaike information criterion (AIC) | \(AIC = -2 \cdot \text{log-likelihood} + 2q\) | AIC(model_name) |
Bayesian information criterion (BIC) | \(AIC = -2 \cdot \text{log-likelihood} + q\text{log}(n)\) | BIC(model_name) |
Where q is the number of parameters in the model and n is the sample size
Both AIC and BIC can only be used to compare models fitting the same data set
In comparing two models, the model with smaller AIC and/or BIC is preferred
Use the Pearson residual statistic to determine if our preliminary final model fits the data well
Use the Hosmer and Lemeshow goodness-of-fit statistic to determine if our preliminary final model fits the data well
Use the ROC-AUC to determine how well model predicts binary outcome
Apply AIC and BIC as a summary measure to make additional comparisons between potential models
Measure of fit | Hypothesis tested? | Equation | R code |
---|---|---|---|
Pearson residual | Yes | \(X^2=\sum_{j=1}^{J}{r\left(Y_j,{\hat{\pi}}_j\right)^2}\) | Not given |
Hosmer-Lemeshow test | Yes | \(\hat{C}=\sum_{k=1}^{g}\frac{\left(o_k-n_k^\prime{\bar{\pi}}_k\right)^2}{n_k^\prime{\bar{\pi}}_k(1-{\bar{\pi}}_k)}\) | hoslem.test() |
AUC-ROC | Kinda | Not given | auc(observed, predicted) |
AIC | Only to compare models | \(AIC = -2 \cdot \text{log-likelihood} + 2q\) | AIC(model_name) |
BIC | Only to compare models | \(AIC = -2 \cdot \text{log-likelihood} + q\text{log}(n)\) | BIC(model_name) |
Special notes:
Use Hosmer-Lemshow test over Pearson residual unless number of covariate patterns is less than 6
Cannot use Pearson residual when there is a continuous variable in the model
For our interaction model: \[\begin{aligned} \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = \widehat\beta_0 &+ &\widehat\beta_1\cdot I(\text{PF}) & + &\widehat\beta_2\cdot Age& + &\widehat\beta_3 \cdot I(\text{PF}) \cdot Age \\ \text{logit}\left(\widehat\pi(\mathbf{X})\right) & = -1.376 &+ &1.002\cdot I(\text{PF})& + &0.063\cdot Age& -&0.057 \cdot I(\text{PF}) \cdot Age \end{aligned}\]
We can examine the overall model fit using:
Not comparing to any other models:
Can be used to compare to other models:
Lesson 12: Assessing Model Fit