Homework 5 Answers

EPI 525

Modified

October 31, 2024

Warning

To see my math equations properly, you need to download the html file, then open it! One Drive does not show the math correctly!!

Book exercises

1.24 Income and education in US counties

The scatterplot below shows the relationship between per capita income (in thousands of dollars) and percent of population with a bachelor’s degree in 3,143 counties in the US in 2010.

a

What are the explanatory and response variables?

b

Describe the relationship between the two variables. Make sure to discuss unusual observations, if any.

c

Can we conclude that having a bachelor’s degree increases one’s income?

1.28 Mix-and-match

Describe the distribution in the histograms below and match them to the box plots.

a) 2

b) 3

c) 1

1.36 Associations

Indicate which of the plots show

a: Positive Association

b: Negative Assocation

Plot 4

c: No assocation

4.2 Heights of adults

Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender, for 507 physically active individuals. The histogram below shows the sample distribution of heights in centimeters.

(a)

What is the point estimate for the average height of active individuals?

171.1 cm

(b)

What is the point estimate for the standard deviation of the heights of active individuals? What about the IQR?

$s=9.4$

(c)

Is a person who is 1m 80cm (180 cm) tall considered unusually tall? And is a person who is 1m 55cm (155cm) considered unusually short? Explain your reasoning.

Not considered unusually tall (or short)

(d)

The researchers take another random sample of physically active individuals. Would you expect the mean and the standard deviation of this new sample to be the ones given above? Explain your reasoning.

(e)

The sample means obtained are point estimates for the mean height of all active individuals, if the sample of individuals is equivalent to a simple random sample. What measure do we use to quantify the variability of such an estimate? Compute this quantity using the data from the original sample under the condition that the data are a simple random sample.

$SE =0.417$

4.3 Hen eggs

The distribution of the number of eggs laid by a certain species of hen during their breeding period is on average, 35 eggs, with a standard deviation of 18.2. Suppose a group of researchers randomly samples 45 hens of this species, counts the number of eggs laid during their breeding period, and records the sample mean. They repeat this 1,000 times, and build a distribution of sample means.

a

What is this distribution called?

b

Would you expect the shape of this distribution to be symmetric, right skewed, or left skewed? Explain your reasoning.

Symmetric

c

Calculate the variability of this distribution and state the appropriate term used to refer to this value.

$2.713$

d

Suppose the researchers’ budget is reduced and they are only able to collect random samples of 10 hens.The sample mean of the number of eggs is recorded, and we repeat this 1,000 times, and build a new distribution of sample means. How will the variability of this new distribution compare to the variability of the original distribution?

1 R exercises

1.1 Load all the packages you need below here

1.2 R1: NHANES

Below you will be using the dataset called NHANES from the NHANES R package.
Install and load the NHANES package using the code below.
- This loads the dataset also called NHANES that is within the NHANES package.

The National Health and Nutrition Examination Survey (NHANES) is a survey conducted annually by the US National Center for Health Statistics (NCHS). While the original data uses a survey design that oversamples certain subpopulations, the data have been reweighted to undo oversampling effects and can be treated as if it were a simple random sample from the American population.

To view the complete list of study variables and their descriptions, access the NHANES documentation page with ?NHANES.
- You must first install the NHANES package to see the help files.

Warning

For most of the summary statistic base R commands (such as mean(), sd(), median(), etc.), you will get NA as the result if there are missing values.
In order for R to compute the statistic using the values in the data set, you need to tell R to remove the missing values using the na.rm = TRUE option within the parentheses of the command: mean(dataset$variablename, na.rm = TRUE).

1.2.1 What are the dimensions and column names of the dataset?

Hint: Use functions covered in the R lesson on Basics in R (part 2)

10,000 rows and 76 columns

1.2.2 How many unique ID identifiers are in the dataset? Compare this to the number of rows in the dataset. What is the explanation for these two different numbers?

This will require a new function called unique(). For example, if I want the unique ages (from variable Age) from the dataset, I can use unique(NHANES$Age)

Then I can use the function, length() to see how long the list of unique IDs is. length(unique(NHANES$Age))

6,779 unique IDs

1.2.3 Using numerical summaries and data visualization, describe the distribution of ages among study participants.

You don’t need to need to use exact stat verbage for this one. Think: is it evenly distributed? Does it trail off? Does it seem like most ages are represented equally? Is there a reason why there’s more 80yo’s than 79 yo’s?

Data visualizations options:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.2.4 Using numerical and graphical summaries, describe the distribution of heights among study participants.

For this one, we learned a few more terms and phrases to describe data. Try them out!

Data visualizations options:

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.2.5 Calculate the median and interquartile range of the distribution of the variable `Poverty`

Write a sentence explaining the median and IQR in the context of these data. Make sure to look up what Poverty means in this dataset so you can give the appropriate context!

Median:

[1] 2.7

IQR:

[1] 3.47

1.2.6 Investigate at which age people generally reach their adult height.

You can use whatever data visualization tool to look at this. Hint: age and height are both numeric variables!

1.2.7 Investigate the relationship between trouble sleeping and hours slept.

This may require you to use a few options to visualize the data! Also, hours slept is numeric, but there’s only 11 unique values. It might be interesting to try out the visualization methods for two categorical variables.

Options to look at the relationship:

Book exercises

1.24 Income and education in US counties

a

b

c

1.28 Mix-and-match

1.36 Associations

a: Positive Association

b: Negative Assocation

c: No assocation

4.2 Heights of adults

(a)

(b)

(c)

(d)

(e)

4.3 Hen eggs

a

b

c

d

1 R exercises

1.1 Load all the packages you need below here

1.2 R1: NHANES

1.2.1 What are the dimensions and column names of the dataset?

1.2.2 How many unique ID identifiers are in the dataset? Compare this to the number of rows in the dataset. What is the explanation for these two different numbers?

1.2.3 Using numerical summaries and data visualization, describe the distribution of ages among study participants.

1.2.4 Using numerical and graphical summaries, describe the distribution of heights among study participants.

1.2.5 Calculate the median and interquartile range of the distribution of the variable Poverty

1.2.6 Investigate at which age people generally reach their adult height.

1.2.7 Investigate the relationship between trouble sleeping and hours slept.

1.2.5 Calculate the median and interquartile range of the distribution of the variable `Poverty`