Homework 4 Answers

BSTA 513/613

Author

Nicky Wakim

Modified

May 23, 2025

Questions Part 1

Question 1

In this problem, we will practice performing model diagnostics in a logistic regression model. You will need to download and source the Logistic_Dx_Functions.R file from the shared Data folder.

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

You will need to use some of the mutations implemented in HW 2, Q2, Part d.

We will use the following model: \[\text{logit}(\pi(\textbf{X}))=\beta_0 + \beta_1 \cdot I(CAN=\text{``Yes"}) + \beta_2 \cdot I(CPR=\text{``Yes"}) + \\ \beta_3 \cdot I(INF=\text{``Yes"})\]

icu = read_csv(here("data", "icu.csv"))
icu1 = icu %>% mutate(STA = as.factor(STA) %>% relevel(ref = "Lived"))
icu2 = icu1 %>% mutate(CAN = as.factor(CAN) %>% relevel(ref = "No"), 
                     CPR = as.factor(CPR) %>% relevel(ref = "No"), 
                     INF = as.factor(INF) %>% relevel(ref = "No"), 
                     LOC = as.factor(LOC) %>% 
                       relevel(ref = "No Coma or Deep Stupor"))

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	−1.924	0.284	−6.764	0.000	−2.521	−1.399
CANYes	0.093	0.614	0.151	0.880	−1.246	1.219
CPRYes	1.514	0.602	2.516	0.012	0.326	2.730
INFYes	0.807	0.371	2.176	0.030	0.084	1.547

Rows: 7
Columns: 16
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, 1, 1
$ CANYes        <dbl> 0, 0, 1, 0, 1, 0, 1
$ CPRYes        <dbl> 0, 0, 0, 1, 0, 1, 1
$ INFYes        <dbl> 1, 0, 1, 0, 0, 1, 1
$ y             <dbl> 17, 12, 1, 1, 3, 6, 0
$ P             <dbl> 0.2465297, 0.1273696, 0.2641879, 0.3989183, 0.1380566, 0…
$ n             <int> 69, 99, 6, 4, 13, 8, 1
$ yhat          <dbl> 17.0105460, 12.6095911, 1.5851272, 1.5956732, 1.7947357,…
$ Pr            <dbl> -0.002945748, -0.183769303, -0.541794689, -0.608232077, …
$ dr            <dbl> -0.002945953, -0.185061578, -0.568559201, -0.627265966, …
$ h             <dbl> 0.013276657, 0.008995739, 0.074100939, 0.093495654, 0.04…
$ sPr           <dbl> -0.0029655, -0.1846015, -0.5630577, -0.6388286, 0.991157…
$ sdr           <dbl> -0.002965706, -0.185899619, -0.590872626, -0.658820007, …
$ dChisq        <dbl> 8.794188e-06, 3.407771e-02, 3.170340e-01, 4.081020e-01, …
$ dDev          <dbl> 8.795411e-06, 3.455867e-02, 3.491305e-01, 4.340438e-01, …
$ dBhat         <dbl> 1.183284e-07, 3.093369e-04, 2.537265e-02, 4.209110e-02, …

Part a

How many covariate patterns does the regression equation have?

Not given

Part b

Plot the change in standardized Pearson residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.

ggplot(dx_icu) + 
  geom_hline(yintercept = 4, col = "red", linetype="dashed", linewidth=1) +
  geom_point(aes(x=P, y=dChisq), size = 3) + 
  xlab("Estimated/Predicted Probability of Death") +
  ylab("Change in Std. Pearson Residual") +
  theme(text = element_text(size = 18)) + 
  xlim(0,1)

Part c

Plot the change in standardized Deviance residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.

Not given

Part d

Plot the change in coefficient estimate by predicted probability. Do you notice any influential points? Explain your reasoning.

Not given

Part e

Plot the leverage by predicted probability. Do you notice any influential points? Explain your reasoning.

Not given

Question 2

In this problem, we will practice fitting and interpreting a log-binomial regression.

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

You will need to use some of the mutations implemented in HW 2, Q2, Part d.

Part a

Write down the population equation for the log-binomial regression model of STA on AGE, CAN, CPR, and INF. How many parameters does this model contain?

This model contains 4 parameters.

Part b

Try using glm() to obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a. Do you run into any issues using glm()? If you try logbin(), does it fix the issues? Explain why and what warnings logbin() gives you.

Hint: keep the glm() function in its own code chunk so you can add #| eval: false. glm() may throw an error, so we want to show the work for glm() even though it’ll break your qmd rendering.

glm() does not work. logbin() has warnings.

Part c

Using logbin(), obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a, but now take out age. Using these estimates, write down the equation with the fitted values.


Call:
logbin(formula = STA ~ CAN + CPR + INF, data = icu2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4845  -0.7570  -0.5297  -0.5197   2.0886  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.0332     0.2374  -8.566  < 2e-16 ***
CANYes       -0.1478     0.4826  -0.306 0.759330    
CPRYes        0.9859     0.2874   3.430 0.000603 ***
INFYes        0.6434     0.2910   2.211 0.027045 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 186.90  on 196  degrees of freedom

  AIC: 194.9 
AIC_c: 195.1 

Number of iterations: 21093 (best: 480)

Part d

Interpret the exponential of the coefficient (risk ratio) estimate for CPR.

Not given

Part e

We were not able to fit a model with age, but let’s just entertain a scenario here. Let’s say we fit the model in Part a and got a coefficient estimate of 0.29 with a 95% confidence interval of 0.23 to 0.35. Using the model is Part a, interpret the exponential of the coefficient (risk ratio) estimate for AGE.

Not given

Questions Part 2

Question 3

For each of the following outcomes, what type of regression would you use? Explain your answer.

Part a

Number of minutes of moderate-to-vigorous physical activity per week (range: 0 to 600+)

Which regression model is most appropriate?

Linear regression
Logistic regression
Log-binomial regression
Poisson regression
Multinomial logistic regression

Linear regression

Part b

Whether the participant meets CDC recommendations for weekly physical activity (Yes/No)

Which regression model is most appropriate?

Linear regression
Logistic regression
Log-binomial regression
Poisson regression
Multinomial logistic regression

Not given

Part c

Number of workouts the person completed in the past week (values: 0, 1, 2, …, 14)

Which regression model is most appropriate?

Linear regression
Logistic regression
Log-binomial regression
Poisson regression
Multinomial logistic regression

Not given

Part d

Self-reported activity level: Sedentary, Moderately active, Highly active

Which regression model is most appropriate?

Linear regression
Logistic regression
Log-binomial regression
Poisson regression
Multinomial logistic regression

Not given