icu = read_csv(here("data", "icu.csv"))
icu1 = icu %>% mutate(STA = as.factor(STA) %>% relevel(ref = "Lived"))
icu2 = icu1 %>% mutate(CAN = as.factor(CAN) %>% relevel(ref = "No"),
CPR = as.factor(CPR) %>% relevel(ref = "No"),
INF = as.factor(INF) %>% relevel(ref = "No"),
LOC = as.factor(LOC) %>%
relevel(ref = "No Coma or Deep Stupor"))Homework 4 Answers
BSTA 513/613
Questions Part 1
Question 1
In this problem, we will practice performing model diagnostics in a logistic regression model. You will need to download and source the Logistic_Dx_Functions.R file from the shared Data folder.
This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).
A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.
You will need to use some of the mutations implemented in HW 2, Q2, Part d.
We will use the following model: \[\text{logit}(\pi(\textbf{X}))=\beta_0 + \beta_1 \cdot I(CAN=\text{``Yes"}) + \beta_2 \cdot I(CPR=\text{``Yes"}) + \\ \beta_3 \cdot I(INF=\text{``Yes"})\]
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | −1.924 | 0.284 | −6.764 | 0.000 | −2.521 | −1.399 |
| CANYes | 0.093 | 0.614 | 0.151 | 0.880 | −1.246 | 1.219 |
| CPRYes | 1.514 | 0.602 | 2.516 | 0.012 | 0.326 | 2.730 |
| INFYes | 0.807 | 0.371 | 2.176 | 0.030 | 0.084 | 1.547 |
Rows: 7
Columns: 16
$ `(Intercept)` <dbl> 1, 1, 1, 1, 1, 1, 1
$ CANYes <dbl> 0, 0, 1, 0, 1, 0, 1
$ CPRYes <dbl> 0, 0, 0, 1, 0, 1, 1
$ INFYes <dbl> 1, 0, 1, 0, 0, 1, 1
$ y <dbl> 17, 12, 1, 1, 3, 6, 0
$ P <dbl> 0.2465297, 0.1273696, 0.2641879, 0.3989183, 0.1380566, 0…
$ n <int> 69, 99, 6, 4, 13, 8, 1
$ yhat <dbl> 17.0105460, 12.6095911, 1.5851272, 1.5956732, 1.7947357,…
$ Pr <dbl> -0.002945748, -0.183769303, -0.541794689, -0.608232077, …
$ dr <dbl> -0.002945953, -0.185061578, -0.568559201, -0.627265966, …
$ h <dbl> 0.013276657, 0.008995739, 0.074100939, 0.093495654, 0.04…
$ sPr <dbl> -0.0029655, -0.1846015, -0.5630577, -0.6388286, 0.991157…
$ sdr <dbl> -0.002965706, -0.185899619, -0.590872626, -0.658820007, …
$ dChisq <dbl> 8.794188e-06, 3.407771e-02, 3.170340e-01, 4.081020e-01, …
$ dDev <dbl> 8.795411e-06, 3.455867e-02, 3.491305e-01, 4.340438e-01, …
$ dBhat <dbl> 1.183284e-07, 3.093369e-04, 2.537265e-02, 4.209110e-02, …
Part a
How many covariate patterns does the regression equation have?
Not given
Part b
Plot the change in standardized Pearson residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.
ggplot(dx_icu) +
geom_hline(yintercept = 4, col = "red", linetype="dashed", linewidth=1) +
geom_point(aes(x=P, y=dChisq), size = 3) +
xlab("Estimated/Predicted Probability of Death") +
ylab("Change in Std. Pearson Residual") +
theme(text = element_text(size = 18)) +
xlim(0,1)Part c
Plot the change in standardized Deviance residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.
Not given
Part d
Plot the change in coefficient estimate by predicted probability. Do you notice any influential points? Explain your reasoning.
Not given
Part e
Plot the leverage by predicted probability. Do you notice any influential points? Explain your reasoning.
Not given
Question 2
In this problem, we will practice fitting and interpreting a log-binomial regression.
This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).
A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.
You will need to use some of the mutations implemented in HW 2, Q2, Part d.
Part a
Write down the population equation for the log-binomial regression model of STA on AGE, CAN, CPR, and INF. How many parameters does this model contain?
This model contains 4 parameters.
Part b
Try using glm() to obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a. Do you run into any issues using glm()? If you try logbin(), does it fix the issues? Explain why and what warnings logbin() gives you.
Hint: keep the glm() function in its own code chunk so you can add #| eval: false. glm() may throw an error, so we want to show the work for glm() even though it’ll break your qmd rendering.
glm() does not work. logbin() has warnings.
Part c
Using logbin(), obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a, but now take out age. Using these estimates, write down the equation with the fitted values.
Call:
logbin(formula = STA ~ CAN + CPR + INF, data = icu2)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4845 -0.7570 -0.5297 -0.5197 2.0886
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.0332 0.2374 -8.566 < 2e-16 ***
CANYes -0.1478 0.4826 -0.306 0.759330
CPRYes 0.9859 0.2874 3.430 0.000603 ***
INFYes 0.6434 0.2910 2.211 0.027045 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 186.90 on 196 degrees of freedom
AIC: 194.9
AIC_c: 195.1
Number of iterations: 21093 (best: 480)
Part d
Interpret the exponential of the coefficient (risk ratio) estimate for CPR.
Not given
Part e
We were not able to fit a model with age, but let’s just entertain a scenario here. Let’s say we fit the model in Part a and got a coefficient estimate of 0.29 with a 95% confidence interval of 0.23 to 0.35. Using the model is Part a, interpret the exponential of the coefficient (risk ratio) estimate for AGE.
Not given
Questions Part 2
Question 3
For each of the following outcomes, what type of regression would you use? Explain your answer.
Part a
Number of minutes of moderate-to-vigorous physical activity per week (range: 0 to 600+)
Which regression model is most appropriate?
- Linear regression
- Logistic regression
- Log-binomial regression
- Poisson regression
- Multinomial logistic regression
- Linear regression
Part b
Whether the participant meets CDC recommendations for weekly physical activity (Yes/No)
Which regression model is most appropriate?
- Linear regression
- Logistic regression
- Log-binomial regression
- Poisson regression
- Multinomial logistic regression
Not given
Part c
Number of workouts the person completed in the past week (values: 0, 1, 2, …, 14)
Which regression model is most appropriate?
- Linear regression
- Logistic regression
- Log-binomial regression
- Poisson regression
- Multinomial logistic regression
Not given
Part d
Self-reported activity level: Sedentary, Moderately active, Highly active
Which regression model is most appropriate?
- Linear regression
- Logistic regression
- Log-binomial regression
- Poisson regression
- Multinomial logistic regression
Not given