Homework 2

BSTA 512/612

Author

Your name here - update this!!!!

Modified

January 26, 2026

Directions

Download the .qmd file here.
You will need to download the datasets. Use this link to download the datasets needed in this assignment. If you do not want to make changes to the paths set in this document, then make sure the files are stored in a folder named “data” that is housed in the same location as your homework .qmd file.
Please upload your homework to Sakai. Upload both your .qmd code file and the rendered .html file
For each question, make sure to include all code and resulting output in the html file to support your answers
Show the work of your calculations using R code within a code chunk. Make sure that both your code and output are visible in the rendered html file. This is the default setting.
Write all answers in complete sentences as if communicating the results to a collaborator.

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your .qmd file and rendering frequently helps you catch your errors more quickly.

Questions

Question 1

This homework assignment is based on data collected as part of an observational study of patients who suffered from stroke.

Dataset: The main goal was to study various psychological factors: optimism, fatalism, depression, spirituality, and their relationship with stroke severity and other health outcomes among the study participants. Data were collected using questionnaires during a baseline interview and also medical chart review. More information about this study can be found in the article Fatalism, optimism, spirituality, depressive symptoms and stroke outcome: a population based analysis.

The dataset that you will work with is called completedata.sas7bdat. It is SIMILAR but does not exactly match the data in the article. It contains information on complete cases (i.e. excludes participants who had missing data on one or more variables of interest) who suffered a stroke. The two variables we are interested in are:

Covariate: Fatalism (larger values indicate that the individual feels less control of their life)
- Scores range from 8 to 40
Outcome: Depression (larger values imply increased depression)
- Scores range from 0 to 27

For our homework purposes we will assume they are continuous.

fatal_dep = read_sas(here("./data/completedata.sas7bdat"))

Part a

Plot the data, with title and axis labels, for Depression (y-axis) vs. Fatalism (x-axis). Comment on what you see.

Part b

Fit a linear regression model to estimate the association between the predictor Fatalism and the outcome Depression.

Interpret the slope and intercept. Does the intercept make sense?

Note

Make sure to include the confidence interval. Whenever asked to interpret coefficients, you must include confidence intervals. Also, the “units” for fatalism and depression are scores.

Part c

In your dataset, make a new variable FatalismC, equal to Fatalism centered at its median (C is for centered).

\[ \text{FatalismC} = \text{Fatalism} - \text{median of Fatalism} \]

This is one way of centering a variable, and can be used when the intercept estimate does not make sense. (Hint: the mutate() function will work well here!)

Plot the data, with title and axis labels, for Depression (y-axis) vs. FatalismC (x-axis).

Part d

Re-run the regression from Part b using this new variable for FatalismC. Interpret the new slope and intercept. Which of the following are the same as Part b: intercept, slope?

Note

Make sure to include the confidence interval. Whenever asked to interpret coefficients, you must include confidence intervals. Also, the “units” for the centered fatalism is still the score.

Part e

From the above interpretations, what would be the equivalent conclusion from a hypothesis test for the association between Depression and Fatalism?

Note

You do not need to go through the whole process for the hypothesis test. You only need to state whether it is rejected or not and site the confidence interval as evidence.

Question 2

For this problem we will be using the penguins dataset from the palmerpenguins R package. We will look at the association between flipper length of penguins (measured in mm) and specific species of penguins.

Description from help file:

Includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex.

More info about the data.

# first install the palmerpenguins package
# install.packages("palmerpenguins")
library(palmerpenguins)


Attaching package: 'palmerpenguins'

The following objects are masked from 'package:datasets':

    penguins, penguins_raw

data(penguins)

# run the command below to learn more about the variables in the penguins dataset
# ?penguins

Part a

Calculate the average flipper lengths stratified by each of the penguin species.

Part b

Make a scatterplot (with jitter) of flipper lengths by species, and include diamond-shape points for the averages of the flipper lengths for each of the species.

Part c

Write out the fitted regression equation that models the flipper length by penguin species. Use LaTeX math markup or insert an image of your equation. Do not yet insert values for the regression coefficients, i.e. use the generic coefficients \(\widehat{\beta}_0, \widehat{\beta}_1\). Use Adelie as the reference level.

Part d

Run the linear regression of flipper lengths vs. species in R, and display the regression table output. Which species did R choose as the reference level, and how did you determine this?

Part e

How do we interpret each of the regression coefficients for this model? Write out a separate interpretation for each of the coefficients.

Part f

Calculate the mean flipper length (and the 95% CI) of penguins in the Chinstrap and Gentoo species using the predict() function.