Homework 2

BSTA 513/613

Due: Thursday April 25, 2024 at 11pm

Author

Your name here - update this!!!!

Published

April 25, 2024

Modified

April 1, 2024

Purpose

This homework is designed to help you practice the following important skills and knowledge that we covered in Lessons 3-6:

Calculate and interpret the estimated risk difference, relative risk, and odds ratios, and their confidence intervals
Expand work on contingency tables to evaluate the agreement or reproducibility using Cohen’s Kappa
Understand important differences between linear regression and logistic regression
Construct a simple logistic regression model
Test a covariate for significance using the Wald test and LRT

Directions

Download the .qmd file here.
You will need to download the datasets. Use this link to download the homework datasets needed in this assignment. If you do not want to make changes to the paths set in this document, then make sure the files are stored in a folder named “data” that is housed in the same location as this homework .qmd file.
Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file
- Please rename you homework as Lastname_Firstinitial_HW1.qmd. This will help organize the homeworks when the TAs grade them.
- Please also add the following line under subtitle: “BSTA 513/613”: author: First-name Last-name with your first and last name so it is attached to the viewable document.
For each question, make sure to include all code and resulting output in the html file to support your answers
Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file. This is the default setting.
Write all answers in complete sentences as if communicating the results to a collaborator.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your qmd file and rendering frequently helps you catch your errors more quickly.

Questions

Question 1

A study looked at the effects of oral contraceptive (OC) use on heart disease in women 40 to 44 years of age. The researchers prospectively tracked whether or not the women developed a myocardial infarction (MI) over a 3-year period. The table below summarizes their results with columns indicating whether or not women developed MI and rows indicating their OC use.

	Yes	No
OC users	13	4987
Non-OC users	7	9993

Part a

Compute the estimated risk difference comparing OC users to non-OC users. Include a 95% CI for the estimate and interpretation of the estimated value.

Part b

Compute the estimated relative risk comparing OC users to non-OC users. Include a 95% CI for the estimate and interpretation of the estimated value.

Part c

Compute the estimated odds ratio comparing OC users to non-OC users. Include a 95% CI for the estimate and interpretation of the estimated value.

Part d

Is the OR a good approximation of the RR? Explain why or why not.

Question 2

One important aspect of medical diagnosis is its reproducibility. Suppose that two different doctors examine 100 patients for dyspnea in a respiratory-disease clinic, and that doctor A diagnosed 15 patients as having dyspnea (while doctor B did not), doctor B diagnosed 10 patients as having dyspnea (while doctor A did not), and both doctor A and doctor B diagnosed 7 patients as having dyspnea.

Part a

Construct a two-way contingency table to summarize the dyspnea diagnoses from doctor A and B.

Part b

Compute the Cohen’s kappa, 95% confidence interval, and interpret the results. What level of agreement does your kappa indicate?

Question 3

This question is meant to emphasize the differences between linear regression and logistic regression. Each part will ask about different aspects of the two regression models. If the question has multiple choice answers, then you must write 1-2 sentences justifying your answer.

Part a

In linear regression, what type of variable is our response/outcome variable?

binary
continuous
count
ordinal

Part b

In logistic regression, what type of variable is our response/outcome variable?

binary
continuous
count
normal

Part c

What assumptions in linear regression? Please state the assumption name and characteristics of that assumption.

Part d

Which assumptions of linear regression are violated if we try to fit a binary response using linear regression? You may choose more than one answer.

Independence
Linearity
Normality
Homoscedasticity

Part e

Please use our notes on generalized linear models (GLMs) to answer this question. What is the random component used in linear regression? What is the random component used in logistic regression? Name the specific variable type and distribution for it.

Part f

Please use our notes on generalized linear models (GLMs) to answer this question. What link function do we use in linear regression? What link function do we use in logistic regression? Name the link and write out the function.

Part g

Please use our notes on generalized linear models (GLMs) to answer this question. What is the systematic component used in simple linear regression? What is the systematic component used in simple logistic regression? Please write out the function for the systematic component for a single covariate.

Part h

How do we determine our coefficient estimates (estimates of the parameter values) in linear regression? You may choose more than one answer.

Ordinary least squares
Maximum likelihood estimation

Part i

How do we determine our coefficient estimates (estimates of the parameter values) in logistic regression? You may choose more than one answer.

Ordinary least squares
Maximum likelihood estimation

Question 4

This question is taken from the Hosmer and Lemeshow textbook. The ICU study data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The dataset should be available within Course Materials. The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients. In this question, the primary outcome variable is vital status at hospital discharge, STA. Clinicians associated with the study felt that a key determinant of survival was the patient’s age at admission, AGE.

A code sheet for the variables to be considered is displayed in Table 1.5 below (from the Hosmer and Lemeshow textbook, pg. 23). We refer to this data set as the ICU data.

Part a

Write down the equation for the logistic regression model of STA on AGE. What characteristic of the outcome variable, STA, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between STA and AGE?

Part b

Write down an expression for the log-likelihood for the logistic regression model in Part a. This will me a mathematical expression. Please do not use generic expressions like \(\pi(X)\), instead replace \(X\) with the specific variables in this question.

Part c

Using the glm() function, obtain the maximum likelihood estimates of the coefficient parameters of the logistic regression model in Part a. Using these estimates, write down fitted logistic regression model.

Part d

Use the Wald test to test whether or not the intercept (\(\beta_0\)) of the logistic regression model is significantly different from 0. Make sure to include your: hypothesis test, code/work leading to the computed test statistic, output including the test statistic and p-value, and conclusion. Please refer to the Additional Tips to guide you on what a complete/correct answer contains.

Part e

Use the Likelihood Ratio test to test whether or not the coefficient for age (\(\beta_1\)) of the logistic regression model is significantly different from 0. Make sure to include your: hypothesis test, code/work leading to the computed test statistic, output including the test statistic and p-value, and conclusion. Please refer to the Additional Tips to guide you on what a complete/correct answer contains. You do not need to include an interpretation of the coefficient since we have not covered this.