Homework 5

BSTA 513/613

Due: Thursday June 6, 2024 at 11pm

Author

Your name here - update this!!!!

Published

June 6, 2024

Modified

June 4, 2024

Directions

Download the .qmd file here.
You will need to download the datasets. Use this link to download the homework datasets needed in this assignment. If you do not want to make changes to the paths set in this document, then make sure the files are stored in a folder named “data” that is housed in the same location as this homework .qmd file.
Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file
- Please rename you homework as Lastname_Firstinitial_HW5.qmd. This will help organize the homeworks when the TAs grade them.
- Please also add the following line under subtitle: "BSTA 513/613": author: First-name Last-name with your first and last name so it is attached to the viewable document.
For each question, make sure to include all code and resulting output in the html file to support your answers.
Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file. This is the default setting.
If you are computing something by hand, you may take a picture of your work and insert the image in this file. You may also use LaTeX to write it inline.
Write all answers in complete sentences as if communicating the results to a collaborator. This means including a sentence summarizing results in the context of the research study.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your qmd file and rendering frequently helps you catch your errors more quickly.

Questions

Question 1

This question stems from an example from an online textbook by Dr. Ramzi W. Nahhas. The dataset for this problem includes a subset of individuals from the 2019 National Survey on Drug Use and Health (NSDUH). Overall, our study aims included investigating potential risk factors for lifetime heroin use. Lifetime heroin use is a binary outcome, which we regress on age at first use of alcohol (alc_agefirst), age with 6 categories (demog_age_cat6), and sex assigned at birth (demog_sex).

load(here("data", "nsduh2019_adult_sub_rmph.RData"))
nsduh = nsduh_adult_sub %>% 
  dplyr::select(her_lifetime, alc_agefirst, demog_age_cat6, demog_sex) %>% 
  drop_na()

Part a

Using the nsduh dataset from the above chunk of code, please run a regression model and present the model summary using lifetime heroin use as our outcome, and age at first use of alcohol, categorical age, and sex assigned at birth as covariates in our model. No need to write out your model, you just need to write the R code to run it.

Part b

Are we encountering a numerical problem with our regression? If yes, please name the numerical issue. What first clued you into that issue? Provide conclusive evidence of this numerical issue (with a contingency table), and explain which variable(s) are causing this problem.

Part c

What would you do to “fix” this numerical issue? Please apply your “fix” and rerun the regression

Question 2

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available within Course Materials. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE. We will be building to a multivariable logistic regression model while adjusting for cancer part of the present problem (CAN), CPR prior to ICU admission (CPR), infection probable at ICU admission (INF), and level of consciousness at ICU admission (LOC).

We will use the model from Homework 4 Question 1a for this question: \[\text{logit}(\pi(\textbf{X}))=\beta_0 + \beta_1 \cdot AGE + \beta_2 \cdot CAN + \beta_3 \cdot CPR + \\ \beta_4 \cdot INF\]

icu = read_csv(here("data", "icu.csv"))
icu1 = icu %>% mutate(STA = as.factor(STA) %>% relevel(ref = "Lived"))
icu2 = icu1 %>% mutate(CAN = as.factor(CAN) %>% relevel(ref = "No"), 
                     CPR = as.factor(CPR) %>% relevel(ref = "No"), 
                     INF = as.factor(INF) %>% relevel(ref = "No"), 
                     LOC = as.factor(LOC) %>% 
                       relevel(ref = "No Coma or Deep Stupor"))

Part a

Assess the fit of the above model. You may use Hosmer-Lemeshow test, or Pearson Residual as appropriate. Discuss your choice and interpret.

Part b

Assess the your models ability to discriminate vital status (STA) using AUC.

Part c

Let’s say a colleague found a different preliminary final model than yours. Using the below model that your colleague found, compare your model to theirs using AIC and BIC.

coll_model = glm(STA ~ SYS + AGE + CPR + INF + AGE*CPR, 
                 data = icu2, family = "binomial")
coll_model


Call:  glm(formula = STA ~ SYS + AGE + CPR + INF + AGE * CPR, family = "binomial", 
    data = icu2)

Coefficients:
(Intercept)          SYS          AGE       CPRYes       INFYes   AGE:CPRYes  
   -1.47960     -0.01343      0.02340     -3.37369      0.53449      0.08370  

Degrees of Freedom: 199 Total (i.e. Null);  194 Residual
Null Deviance:      200.2 
Residual Deviance: 172.5    AIC: 184.5