Homework 5

Lessons 7-9

Author

Your name here - update this!!!!

Modified

July 11, 2025

Directions

Please turn in this homework on Sakai. This homework must be submitted using a Quarto document. Please keep it rendered as an html! I know past homeworks said pdf, but all Quarto docs should be rendered as html for this class!

You can download the .qmd file for this assignment from Github

Tip

It is a good idea to try rendering your document from time to time as you go along! Note that rendering automatically saves your Qmd file and rendering frequently helps you catch your errors more quickly.

0.1 Complete the mid-quarter feedback

https://forms.office.com/Pages/ResponsePage.aspx?id=V3lz4rj6fk2U9pvWr59xWFMopmPUjRtDiHLswhgacNhURVc3WkpWQVk5OVQ1OUhSMkg0UUZLUjZJOC4u

Book exercises

1.24 Income and education in US counties

1.28 Mix-and-match

1.36 Associations

4.2 Heights of adults

Please note that the wording of this problem is a little confusing. The textbook uses the phrase “sample distribution,” which is very close to the phrasing we’ve used to discuss the “sampling distribution” to refer to the sampling distribution of the sample means. However, they are using “sample distribution” to refer to the distribution of the sample of 507 active individuals.

Important

Try using LaTeX code to write out any math work that you would use in part c and part e!! You can insert an image of your written if you would like.

  • Hint for part a and b: point estimates are sample statistics
  • Hint for part c: It would be helpful to calculate the z-score
  • Hint for part d: Would it be exactly the same?
  • Hint for part e: Using the standard deviation from our sample, can we get an estimate of the standard error for the sampling distribution of the sample means? Compute that standard error.

4.3 Hen eggs

1 R exercises

1.1 Load all the packages you need below here

1.2 R1: NHANES

  • Below you will be using the dataset called NHANES from the NHANES R package.
  • Install and load the NHANES package using the code below.
    • This loads the dataset also called NHANES that is within the NHANES package.
install.packages("NHANES")
library(NHANES)
data("NHANES")

The National Health and Nutrition Examination Survey (NHANES) is a survey conducted annually by the US National Center for Health Statistics (NCHS). While the original data uses a survey design that oversamples certain subpopulations, the data have been reweighted to undo oversampling effects and can be treated as if it were a simple random sample from the American population.

  • To view the complete list of study variables and their descriptions, access the NHANES documentation page with ?NHANES.
    • You must first install the NHANES package to see the help files.
Warning
  • For most of the summary statistic base R commands (such as mean(), sd(), median(), etc.), you will get NA as the result if there are missing values.
  • In order for R to compute the statistic using the values in the data set, you need to tell R to remove the missing values using the na.rm = TRUE option within the parentheses of the command: mean(dataset$variablename, na.rm = TRUE).

1.2.1 What are the dimensions and column names of the dataset?

Hint: Use functions covered in the R lesson on Basics in R (part 2)

1.2.2 How many unique ID identifiers are in the dataset? Compare this to the number of rows in the dataset. What is the explanation for these two different numbers?

This will require a new function called unique(). For example, if I want the unique ages (from variable Age) from the dataset, I can use unique(NHANES$Age)

Then I can use the function, length() to see how long the list of unique IDs is. length(unique(NHANES$Age))

1.2.3 Using numerical summaries and data visualization, describe the distribution of ages among study participants.

You don’t need to need to use exact stat verbage for this one. Think: is it evenly distributed? Does it trail off? Does it seem like most ages are represented equally? Is there a reason why there’s more 80yo’s than 79 yo’s?

1.2.4 Using numerical and graphical summaries, describe the distribution of heights among study participants.

For this one, we learned a few more terms and phrases to describe data. Try them out!

1.2.5 Calculate the median and interquartile range of the distribution of the variable Poverty

Write a sentence explaining the median and IQR in the context of these data. Make sure to look up what Poverty means in this dataset so you can give the appropriate context!

1.2.6 Investigate at which age people generally reach their adult height.

You can use whatever data visualization tool to look at this. Hint: age and height are both numeric variables!

1.2.7 Investigate the relationship between trouble sleeping and hours slept.

This may require you to use a few options to visualize the data! Also, hours slept is numeric, but there’s only 11 unique values. It might be interesting to try out the visualization methods for two categorical variables.