Homework 4

BSTA 513/613

Author

Your name here - update this!!!!

Modified

May 23, 2025

Caution

This is ready to go! (5/23)

Purpose

This homework is designed to help you practice the following important skills and knowledge that we covered in Lesson 13, 15-16:

  • Visually assessing observations for which the model does not fit well
  • Selecting the appropriate regression model based on outcome type
  • Fitting and interpreting a log-binomial regressions

Directions

  • Download the .qmd file here.

  • You will need to download the datasets from our shared folder.

  • Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file

    • Please rename you homework as Lastname_Firstinitial_HW04.qmd. This will help organize the homeworks when the TAs grade them.
  • For each question, make sure to include all code and resulting output in the html file to support your answers

  • Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file. This is the default setting.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your qmd file and rendering frequently helps you catch your errors more quickly.

Questions Part 1

Question 1

In this problem, we will practice performing model diagnostics in a logistic regression model. You will need to download and source the Logistic_Dx_Functions.R file from the shared Data folder.

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

You will need to use some of the mutations implemented in HW 2, Q2, Part d.

We will use the following model: \[\text{logit}(\pi(\textbf{X}))=\beta_0 + \beta_1 \cdot I(CAN=\text{``Yes"}) + \beta_2 \cdot I(CPR=\text{``Yes"}) + \\ \beta_3 \cdot I(INF=\text{``Yes"})\]

icu = read_csv(here("data", "icu.csv"))
icu1 = icu %>% mutate(STA = as.factor(STA) %>% relevel(ref = "Lived"))
icu2 = icu1 %>% mutate(CAN = as.factor(CAN) %>% relevel(ref = "No"), 
                     CPR = as.factor(CPR) %>% relevel(ref = "No"), 
                     INF = as.factor(INF) %>% relevel(ref = "No"), 
                     LOC = as.factor(LOC) %>% 
                       relevel(ref = "No Coma or Deep Stupor"))

Part a

How many covariate patterns does the regression equation have?

Part b

Plot the change in standardized Pearson residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.

Part c

Plot the change in standardized Deviance residual by predicted probability. Do you notice any potential outliers? Explain your reasoning.

Part d

Plot the change in coefficient estimate by predicted probability. Do you notice any influential points? Explain your reasoning.

Part e

Plot the leverage by predicted probability. Do you notice any influential points? Explain your reasoning.

Question 2

In this problem, we will practice fitting and interpreting a log-binomial regression.

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available in our shared folder. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital (survival) status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will build to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

You will need to use some of the mutations implemented in HW 2, Q2, Part d.

Part a

Write down the population equation for the log-binomial regression model of STA on AGE, CAN, CPR, and INF. How many parameters does this model contain?

Part b

Try using glm() to obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a. Do you run into any issues using glm()? If you try logbin(), does it fix the issues? Explain why and what warnings logbin() gives you.

Hint: keep the glm() function in its own code chunk so you can add #| eval: false. glm() may throw an error, so we want to show the work for glm() even though it’ll break your qmd rendering.

Part c

Using logbin(), obtain the maximum likelihood estimates of the parameters of the log-binomial regression model in Part a, but now take out age. Using these estimates, write down the equation with the fitted values.

Part c

Interpret the exponential of the coefficient (risk ratio) estimate for CPR.

Part e

We were not able to fit a model with age, but let’s just entertain a scenario here. Let’s say we fit the model in Part a and got a coefficient estimate of 0.29 with a 95% confidence interval of 0.23 to 0.35. Using the model is Part a, interpret the exponential of the coefficient (risk ratio) estimate for AGE.

Questions Part 2

Question 3

For each of the following outcomes, what type of regression would you use? Explain your answer.

Part a

Number of minutes of moderate-to-vigorous physical activity per week (range: 0 to 600+)

Which regression model is most appropriate?

  1. Linear regression
  2. Logistic regression
  3. Log-binomial regression
  4. Poisson regression
  5. Multinomial logistic regression

Part b

Whether the participant meets CDC recommendations for weekly physical activity (Yes/No)

Which regression model is most appropriate?

  1. Linear regression
  2. Logistic regression
  3. Log-binomial regression
  4. Poisson regression
  5. Multinomial logistic regression

Part c

Number of workouts the person completed in the past week (values: 0, 1, 2, …, 14)

Which regression model is most appropriate?

  1. Linear regression
  2. Logistic regression
  3. Log-binomial regression
  4. Poisson regression
  5. Multinomial logistic regression

Part d

Self-reported activity level: Sedentary, Moderately active, Highly active

Which regression model is most appropriate?

  1. Linear regression
  2. Logistic regression
  3. Log-binomial regression
  4. Poisson regression
  5. Multinomial logistic regression