Homework 3 Answers

BSTA 513/613

Author

Nicky Wakim

Modified

May 23, 2025

Questions Part 1

Question 1

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

You will need to use some of the mutations implemented in HW 2, Q2, Part d.

Part a

Write down the population equation for the logistic regression model of STA on AGE, CAN, CPR, and INF. How many parameters does this model contain?

Answer:

5 parameters

Part b

Using glm(), obtain the maximum likelihood estimates of the parameters of the logistic regression model in Part a. Using these estimates, write down the equation with the fitted values.

Answer:

term estimate std.error statistic p.value conf.low conf.high
(Intercept) −3.621 0.790 −4.582 0.000 −5.321 −2.202
AGE 0.028 0.011 2.466 0.014 0.007 0.052
CANYes 0.202 0.611 0.330 0.742 −1.127 1.328
CPRYes 1.637 0.616 2.659 0.008 0.425 2.881
INFYes 0.702 0.378 1.858 0.063 −0.035 1.455

Need to write equation.

Part c

Assess the significance of the group of coefficients for all variables in the model using the likelihood ratio test. (Hint: part of the ratio in the LRT will be an intercept only model)

Likelihood ratio test

Model 1: STA ~ AGE + CAN + CPR + INF
Model 2: STA ~ 1
  #Df   LogLik Df  Chisq Pr(>Chisq)    
1   5  -90.204                         
2   1 -100.080 -4 19.753  0.0005586 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Part d

Fit a new model using only CAN and INF as the predictors, including an interaction between CAN and INF. Is there evidence that our model should have an interaction between CAN and INF (Hint: this requires a formal test of the interaction)?

Answer:

term estimate std.error statistic p.value conf.low conf.high
(Intercept) −1.935 0.297 −6.521 0.000 −2.564 −1.391
CANYes 0.731 0.722 1.012 0.311 −0.855 2.067
INFYes 1.081 0.387 2.792 0.005 0.336 1.864
CANYes:INFYes −1.669 1.323 −1.262 0.207 −4.868 0.759

Part e

Interpret the odds ratio for the main effects in the model from Part d. Please include the 95% confidence interval.

Not given

Part f

From the above model, fill out the following table for the odds ratios. Note, you will only need to report two odds ratios and you already have one from Part d & e.

Cancer Infection Estimated odds ratio 95% CI
Cancer part of present problem Infection probable at ICU intake
      No
      Yes FILL HERE FILL HERE
Cancer not part of present problem Infection probable at ICU intake
      No
      Yes FILL HERE FILL HERE

This is a really good way to report odds ratios for interactions between two categorical predictors! Might want to keep this in mind for your project!!

Answer:

Cancer Infection Estimated odds ratio 95% CI
Cancer part of present problem Infection probable at ICU intake
      No
      Yes 0.556 Not given
Cancer not part of present problem Infection probable at ICU intake
      No
      Yes 2.949 Not given

Part g

Interpret the odds ratio from the table in Part e. Please include the 95% confidence interval. What do you notice about the odds ratios (Hint: Think back to my last slides in Lesson 10: Interactions)?

Not given

Part h

Compute the predicted probability for a subject who does not have a present issue with cancer nor an infection upon admittance to the ICU. Compute the 95% confidence interval for the predicted probability. Can you use the Normal approximation?

0.126

Part i

Interpret the predicted probability from Part h (right above), including the confidence interval.

Not given

Part j

Building off of Part e, fill out the following table for predicted probabilities. What do you notice about the predicted probabilities (Hint: Think back to my last slides in Lesson 10: Interactions)?

Cancer Infection Predicted probability 95% CI
Cancer part of present problem Infection probable at ICU intake
      No FILL HERE FILL HERE
      Yes FILL HERE FILL HERE
Cancer not part of present problem Infection probable at ICU intake
      No FILL HERE FILL HERE
      Yes FILL HERE FILL HERE
Cancer Infection Predicted probability 95% CI
Cancer part of present problem Infection probable at ICU intake
      No 0.002, 0.460
      Yes 0.143
Cancer not part of present problem Infection probable at ICU intake
      No 0.126
      Yes 0.196, 0.401

Question 2

We will continue with the same dataset from Question 1 above.

We will use the model from Homework 4 Question 1a for this question: \[\text{logit}(\pi(\textbf{X}))=\beta_0 + \beta_1 \cdot AGE + \beta_2 \cdot I(CAN=\text{``Yes"}) + \beta_3 \cdot I(CPR=\text{``Yes"}) + \\ \beta_4 \cdot I(INF=\text{``Yes"})\]

Part a

Assess the fit of the above model. You may use Hosmer-Lemeshow test or Pearson Residual as appropriate. Discuss your choice and interpret.

Model does not fit data well

Part b

Assess the your models ability to discriminate vital status (STA) using AUC.

AUC = 0.6912

Part c

Let’s say a colleague found a different preliminary final model than yours. Using the below model that your colleague found, compare your model to theirs using AIC and BIC.


Call:  glm(formula = STA ~ SYS + AGE + CPR + INF + AGE * CPR, family = "binomial", 
    data = icu2)

Coefficients:
(Intercept)          SYS          AGE       CPRYes       INFYes   AGE:CPRYes  
   -1.47960     -0.01343      0.02340     -3.37369      0.53449      0.08370  

Degrees of Freedom: 199 Total (i.e. Null);  194 Residual
Null Deviance:      200.2 
Residual Deviance: 172.5    AIC: 184.5

Not given

Question 3

This question stems from an example from an online textbook by Dr. Ramzi W. Nahhas. The dataset for this problem includes a subset of individuals from the 2019 National Survey on Drug Use and Health (NSDUH). Overall, our study aims included investigating potential risk factors for lifetime heroin use. Lifetime heroin use is a binary outcome, which we regress on age at first use of alcohol (alc_agefirst), age with 6 categories (demog_age_cat6), and sex assigned at birth (demog_sex).

load(here("data", "nsduh2019_adult_sub_rmph.RData"))
nsduh = nsduh_adult_sub %>% 
  dplyr::select(her_lifetime, alc_agefirst, demog_age_cat6, demog_sex) %>% 
  drop_na()

Part a

Using the nsduh dataset from the above chunk of code, please run a regression model and present the model summary using lifetime heroin use as our outcome, and age at first use of alcohol, categorical age, and sex assigned at birth as covariates in our model. No need to write out your model, you just need to write the R code to run it.

term estimate std.error statistic p.value conf.low conf.high
(Intercept) −15.061 1,024.209 −0.015 0.988 NA 43.320
alc_agefirst −0.244 0.064 −3.821 0.000 −0.373 −0.121
demog_age_cat626-34 15.539 1,024.209 0.015 0.988 −42.842 NA
demog_age_cat635-49 15.447 1,024.209 0.015 0.988 −47.994 NA
demog_age_cat650-64 15.602 1,024.209 0.015 0.988 −42.779 NA
demog_age_cat665+ 15.363 1,024.209 0.015 0.988 −43.018 NA
demog_sexFemale −1.238 0.653 −1.897 0.058 −2.727 −0.076

Part b

Are we encountering a numerical problem with our regression? If yes, please name the numerical issue. What first clued you into that issue? Provide conclusive evidence of this numerical issue (with a contingency table), and explain which variable(s) are causing this problem.

Yes

Part c

What would you do to “fix” this numerical issue? Please apply your “fix” and rerun the regression

Not given

Questions Part 2

The following questions are intended to give you practice in connecting concepts that will help you make decisions in real world applications.

Question 4

Using a similar table to the one in Lesson 8, go back through the parts in this homework and determine which test can be run.

Wald test/CI Score test LRT
Question 1, Part c: testing group of variables
Question 1, Part d: testing interaction
Question 2, Part e: creating 95% CI for main effects
Question 2, Part f: creating 95% CI for odds ratios

One row filled out:

Wald test/CI Score test LRT
Question 1, Part c: testing group of variables No No Yes
Question 1, Part d: testing interaction
Question 2, Part e: creating 95% CI for main effects
Question 2, Part f: creating 95% CI for odds ratios

Question 5

Look back at our slides for interactions, particularly the last slide.

Part a

In the lecture, we investigated an interaction between a binary variable and a continuous variable. In Question 1 of this homework, we investigated an interaction between two binary variables. What about the variables types were different? How did this change the presentation of estimated odds ratios and predicted probabilities?

Part b

How would you present the odds ratios and predicted probabilities if we had an interaction between a 3-category multilevel variable and a continuous variable?

Part c

How would you present the odds ratios and predicted probabilities if we had an interaction between a 4-category multilevel variable and a binary variable?