TB sections 4.3, 5.1
2024-11-06
Research question is a generic form: Is there evidence to support that the population mean is different than \(\mu\)?
Two approaches to answer this question:
Confidence interval
Hypothesis test
Do these confidence intervals include \(\mu\)?
Assuming the population mean is \(\mu\), what is the probability that we observe \(\overline{x}\) or a more extreme sample mean?
Case 1: We know the population standard deviation
\[\overline{x}\ \pm\ z^*\times \text{SE}\]
qnorm(p = 0.975)
\(=1.96\)Case 2: We do not know the population sd
\[\overline{x}\ \pm\ t^*\times \text{SE}\]
qt(p = 0.975, df = n-1)
Case 1: We know the population standard deviation
We use a test statistic from a Normal distribution: \[z_{\overline{x}} = \dfrac{\overline{x} - \mu}{SE}\]
with \(\text{SE} = \frac{\sigma}{\sqrt{n}}\) and \(\sigma\) is the population standard deviation
Case 2: We do not know the population sd
We use a test statistic from a Student’s t-distribution: \[t_{\overline{x}} = \dfrac{\overline{x} - \mu}{SE}\]
with \(\text{SE} = \frac{s}{\sqrt{n}}\) and \(\sigma\) is the sample standard deviation
Question: based on the 1992 JAMA data, is there evidence to support that the population mean body temperature is different from 98.6°F?
Two approaches to answer this question:
Confidence interval
Hypothesis test
\[\overline{x} = 98.25,~s=0.733,~n=130\]
CI for \(\mu\): \[\begin{align} \overline{x} &\pm t^*\cdot\frac{s}{\sqrt{n}}\\ 98.25 &\pm 1.979\cdot\frac{0.733}{\sqrt{130}}\\ 98.25 &\pm 0.127\\ (98.123&, 98.377) \end{align}\]
Used \(t^*\) = qt(.975, df=129)
= 1.979
Conclusion: We are 95% confident that the (population) mean body temperature is between 98.123°F and 98.377°F, which is discernably different than 98.6°F.
From before:
This does not give us a range of plausible values for the population mean \(\mu\).
Instead, we calculate a test statistic and p-value
How do we calculate a test statistic and p-value?
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
The assumptions to run a hypothesis test on a sample are:
In our example, we would check the assumptions with a statement:
Before doing a hypothesis test, we set a cut-off for how small the \(p\)-value should be in order to reject \(H_0\).
It is important to specify how rare or unlikely an event must be in order to represent sufficient evidence against the null hypothesis.
We call this the significance level, denoted by the Greek symbol alpha ( \(\alpha\) )
This is parallel to our confidence interval
In statistics, a hypothesis is a statement about the value of an unknown population parameter.
A hypothesis test consists of a test between two competing hypotheses:
Example of hypotheses in words:
\[\begin{aligned} H_0 &: \text{The population mean body temperature is 98.6°F}\\ \text{vs. } H_A &: \text{The population mean body temperature is not 98.6°F} \end{aligned}\]Notation for hypotheses
Hypotheses test for example
We call \(\mu_0\) the null value (hypothesized population mean from \(H_0\))
\(H_A: \mu \neq \mu_0\)
\(H_A: \mu < \mu_0\)
\(H_A: \mu > \mu_0\)
Case 1: We know the population standard deviation
We use a test statistic from a Normal distribution: \[z_{\overline{x}} = \dfrac{\overline{x} - \mu}{SE}\]
with \(\text{SE} = \frac{\sigma}{\sqrt{n}}\) and \(\sigma\) is the population standard deviation
Statistical theory tells us that \(z_{\overline{x}}\) follows a Standard Normal distribution \(N(0,1)\)
Case 2: We do not know the population sd
We use test statistic from Student’s t-distribution: \[t_{\overline{x}} = \dfrac{\overline{x} - \mu}{SE}\]
with \(\text{SE} = \frac{s}{\sqrt{n}}\) and \(\sigma\) is the sample standard deviation
Statistical theory tells us that \(t_{\overline{x}}\) follows a Student’s t distribution with degrees of freedom (df) = \(n-1\)
\(\overline{x}\) = sample mean, \(\mu_0\) = hypothesized population mean from \(H_0\),
\(\sigma\) = population standard deviation, \(s\) = sample standard deviation,
\(n\) = sample size
From our example: Recall that \(\overline{x} = 98.25\), \(s=0.733\), and \(n=130\)
The test statistic is:
\[t_{\overline{x}} = \frac{\overline{x} - \mu_0}{\frac{s}{\sqrt{n}}} = \frac{98.25 - 98.6}{\frac{0.73}{\sqrt{130}}} = -5.45\]
The p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true.
Calculate the p-value using the Student’s t-distribution with \(df = n-1 = 130-1=129\):
\[\text{p-value}=P(T \leq -5.45) + P(T \geq 5.45) = 2.410889 \times 10^{-07}\]
[1] 2.410889e-07
If \(\text{p-value} < \alpha\), reject the null hypothesis
If \(\text{p-value} \geq \alpha\), fail to reject the null hypothesis
Recall the \(p\)-value = \(2.410889 \times 10^{-07}\)
Need to compare back to our selected \(\alpha = 0.05\)
Do we reject or fail to reject \(H_0\)?
Conclusion statement:
BodyTemperatures.csv
library(here) # first install this package
BodyTemps <- read.csv(here::here("data", "BodyTemperatures.csv"))
# location: look in "data" folder
# for the file "BodyTemperatures.csv"
glimpse(BodyTemps)
Rows: 130
Columns: 3
$ Temperature <dbl> 96.3, 96.7, 96.9, 97.0, 97.1, 97.1, 97.1, 97.2, 97.3, 97.4…
$ Gender <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ HeartRate <int> 70, 71, 74, 80, 73, 75, 82, 64, 69, 70, 68, 72, 78, 70, 75…
t.test
: base R’s function for testing one meanBodyTemps
when we loaded it(temps_ttest <- t.test(x = BodyTemps$Temperature,
alternative = "two.sided", # default setting
mu = 98.6))
One Sample t-test
data: BodyTemps$Temperature
t = -5.4548, df = 129, p-value = 2.411e-07
alternative hypothesis: true mean is not equal to 98.6
95 percent confidence interval:
98.12200 98.37646
sample estimates:
mean of x
98.24923
Note that the test output also gives the 95% CI using the t-distribution.
tidy()
the t.test
outputtidy()
function from the broom
package for briefer output in table format that’s stored as a tibble
gt()
function from the gt
package, we get a nice tableestimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
98.24923 | -5.454823 | 2.410632e-07 | 129 | 98.122 | 98.37646 | One Sample t-test | two.sided |
tidy()
output is a tibble, we can easily pull()
specific values from it:CI’s and hypothesis testing for different scenarios:
Lesson | Section | Population parameter | Symbol (pop) | Point estimate | Symbol (sample) |
---|---|---|---|---|---|
11 | 5.1 | Pop mean | \(\mu\) | Sample mean | \(\overline{x}\) |
12 | 5.2 | Pop mean of paired diff | \(\mu_d\) or \(\delta\) | Sample mean of paired diff | \(\overline{x}_{d}\) |
13 | 5.3 | Diff in pop means | \(\mu_1-\mu_2\) | Diff in sample means | \(\overline{x}_1 - \overline{x}_2\) |
15 | 8.1 | Pop proportion | \(p\) | Sample prop | \(\widehat{p}\) |
15 | 8.2 | Diff in pop prop’s | \(p_1-p_2\) | Diff in sample prop’s | \(\widehat{p}_1-\widehat{p}_2\) |
Example of hypothesis test based on the 1992 JAMA data
Is there evidence to support that the population mean body temperature is different from 98.6°F?
4-5.
Hypothesis:
\[\begin{aligned} H_0 &: \mu = 98.6\\ \text{vs. } H_A&: \mu \neq 98.6 \end{aligned}\]temps_ttest <- t.test(x = BodyTemps$Temperature, mu = 98.6)
tidy(temps_ttest) %>% gt() %>% tab_options(table.font.size = 36)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
98.24923 | -5.454823 | 2.410632e-07 | 129 | 98.122 | 98.37646 | One Sample t-test | two.sided |
Lesson 11 Slides