TB sections 8.1-8.2
2024-11-25
Remind ourselves of the Normal approximation of the binomial distribution and define the sampling distribution of a sample proportion
Run a hypothesis test for a single proportion and interpret the results.
Construct and interpret confidence intervals for a single proportion.
Understand how CLT applies to a difference in binomial random variables
Run a hypothesis test for a difference in proportions and interpret the results.
Construct and interpret confidence intervals for a difference in proportions.
Run a hypothesis test for a single proportion and interpret the results.
Construct and interpret confidence intervals for a single proportion.
Understand how CLT applies to a difference in binomial random variables
Run a hypothesis test for a difference in proportions and interpret the results.
Construct and interpret confidence intervals for a difference in proportions.
Previously, we have discussed methods of inference for numerical data
Categorical data arise frequently in medical research
Binomial random variable
\(X\) is a binomial random variable if it represents the number of successes in \(n\) independent replications (or trials) of an experiment where
A binomial random variable takes on values \(0, 1, 2, \dots, n\).
If a r.v. \(X\) is modeled by a Binomial distribution, then we write in shorthand \(X \sim \text{Binom}(n,p)\)
Quick example: The number of heads in 3 tosses of a fair coin is a binomial random variable with parameters \(n = 3\) and \(p = 0.5\).
Distribution of a Binomial random variable
Let \(X\) be the total number of successes in \(n\) independent trials, each with probability \(p\) of a success. Then probability of observing exactly \(k\) successes in \(n\) independent trials is
\[P(X = x) = \binom{n}{x} p^x (1-p)^{n-x}, x= 0, 1, 2, \dots, n \]
The parameters of a binomial distribution are \(p\) and \(n\).
If a r.v. \(X\) is modeled by a binomial distribution, then we write in shorthand \(X \sim \text{Binom}(n,p)\)
Mean and variance of a Binomial r.v
If \(X\) is a binomial r.v. with probability of success \(p\), then \(E(X) = np\) and \(\text{Var}(X)=np(1-p)\)
Also known as: Sampling distribution of \(\widehat{p}\)
If \(X\sim \text{Binomial}(n,p)\) and \(np>10\) and \(nq = n(1-p) > 10\)
THEN approximately \[X\sim \text{Normal}\big(\mu_X = np, \sigma_X = \sqrt{np(1-p)} \big)\]
Continuity Correction: Applied to account for the fact that the binomial distribution is discrete, while the normal distribution is continuous
\[X \sim N\Big(\mu = np, \sigma = \sqrt{np(1-p)} \Big)\]
\[\hat{p} \sim N\Big(\mu_{\hat{p}} = p, \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \Big)\]
Population parameter
Sample statistic (point estimate)
Calculate CI for the proportion \(p\):
\[\hat{p} \pm z^* \cdot SE_{\hat{p}} = \hat{p} \pm z^* \cdot\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
Run a hypothesis test:
Hypotheses
\[\begin{align} H_0:& p = p_0 \\ H_A:& p \neq p_0 \\ (or&~ <, >) \end{align}\]
Test statistic
\[ z_{\hat{p}} = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0\cdot(1-p_0)}{n}}} \]
x
: Counts of successes (can have one x or a vector of multiple x’s)n
: Number of trails (can have one n or a vector of multiple n’s)p
: Null value that we think the population proportion isalternative
: If alternative hypothesis is \(\neq\), \(<\), or \(>\)
conf.level
= Confidence level (\(1-\alpha\))
0.05
correct
: Continuity correction, whether we should use it or not
TRUE
(Nicky says keep it this way!)Construct and interpret confidence intervals for a single proportion.
Understand how CLT applies to a difference in binomial random variables
Run a hypothesis test for a difference in proportions and interpret the results.
Construct and interpret confidence intervals for a difference in proportions.
Looking for therapies that trigger an immune response to advanced melanoma
In a study where 52 patients were treated concurrently with two new therapies, nivolumab and ipilimumab
Outcome: whether or not each person has an immune response
Questions that can be addressed with inference…
What is the estimated population probability of immune response following concurrent therapy with nivolumab and ipilimumab? (calculate \(\hat{p}\))
What is the 95% confidence interval for the estimated population probability of immune response following concurrent therapy with nivolumab and ipilimumab? (95% CI of \(p\))
In previous studies, the proportion of patients responding to one of these agents was 30% or less. Do these results suggest that the probability of response to concurrent therapy is better than 0.30? (Hypothesis test of null of 0.3)
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
The sampling distribution of \(\hat{p}\) is approximately normal when
The sample observations are independent, and
At least 10 successes and 10 failures are expected in the sample: \(np_0 \geq 10\) and \(n(1-p_0) \geq 10\).
Before doing a hypothesis test, we set a cut-off for how small the \(p\)-value should be in order to reject \(H_0\).
Typically choose \(\alpha = 0.05\)
Notation for hypotheses (for paired data)
Hypotheses test for example
We call \(p_0\) the null value (hypothesized population mean difference from \(H_0\))
\(H_A: p \neq p_0\)
\(H_A: p < p_0\)
\(H_A: p > p_0\)
Null and alternative hypotheses in words and in symbols.
One sample test
\(H_0\): For individuals who have advanced melanoma and received a treatment of nivolumab and ipilimumab, the population proportion of immune response is 0.30
\(H_A\): For individuals who have advanced melanoma and received a treatment of nivolumab and ipilimumab, the population proportion of immune response is NOT 0.30
\[\begin{align} H_0:& p = 0.30\\ H_A:& p \neq 0.30\\ \end{align}\]
Sampling distribution of \(\hat{p}\) if we assume \(H_0: p=p_0\) is true:
\[\hat{p} \sim N\left(\mu_{\hat{p}} = p, \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \right) \sim N\left( \mu_{\hat{p}}=p_0, \sigma_{\hat{p}}=\sqrt{\frac{p_0\cdot(1-p_0)}{n}} \right)\]
Test statistic for a one sample proportion test:
\[ \begin{aligned} \text{test stat} = & \frac{\text{point estimate}-\text{null value}}{SE}\\ z_{\hat{p}} = & \frac{\hat{p} - p_0}{\sqrt{\frac{p_0\cdot(1-p_0)}{n}}} \end{aligned} \]
From our example: Recall that \(\hat{p} = \dfrac{21}{52}= 0.4038\), \(n=52\), and \(p_0 = 0.30\)
The test statistic is:
\[ \begin{align} z_{\hat{p}} &= \frac{\hat{p} - p_0}{\sqrt{\frac{p_0\cdot(1-p_0)}{n}}} = \frac{21/52 - 0.30}{\sqrt{\frac{0.30\cdot(1-0.30)}{52}}} = 1.6341143 \end{align} \]
The p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true.
prop.test()
1-sample proportions test with continuity correction
data: 21 out of 52, null probability 0.3
X-squared = 2.1987, df = 1, p-value = 0.1381
alternative hypothesis: true p is not equal to 0.3
95 percent confidence interval:
0.2731269 0.5487141
sample estimates:
p
0.4038462
prop.test()
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
0.4038462 | 2.198718 | 0.1381256 | 1 | 0.2731269 | 0.5487141 | 1-sample proportions test with continuity correction | two.sided |
\[\begin{align} H_0:& p = 0.30\\ H_A:& p \neq 0.30\\ \end{align}\]
Conclusion statement:
Understand how CLT applies to a difference in binomial random variables
Run a hypothesis test for a difference in proportions and interpret the results.
Construct and interpret confidence intervals for a difference in proportions.
Confidence interval conditions
\[n\hat{p} \ge 10, \ \ n(1-\hat{p})\ge 10\]
Hypothesis test conditions
\[n p_0 \ge 10, \ \ n(1-p_0)\ge 10\]
What to use for SE in CI formula?
\[\hat{p} \pm z^* \cdot SE_{\hat{p}}\]
Sampling distribution of \(\hat{p}\):
\[\hat{p} \sim N\left(\mu_{\hat{p}} = p, \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \right)\]
Problem: We don’t know what \(p\) is - it’s what we’re estimating with the CI.
Solution: approximate \(p\) with \(\hat{p}\):
\[SE_{\hat{p}} = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
95% CI for population mean difference \(p\):
\[\begin{align} \hat{p} &\pm z^* \cdot SE_{\hat{p}}\\ \hat{p} &\pm z^* \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \\ 0.404 &\pm 1.96\cdot \sqrt{\frac{0.404(1-0.404)}{52}} \\ 0.404 &\pm 1.96\cdot 0.068\\ 0.404 &\pm 0.133\\ (0.27&, 0.537) \end{align}\]
Used \(z^*\) = qnorm(0.975)
= 1.96
“By hand” Conclusion:
We are 95% confident that the (population) proportion of individuals with an immune response is between 0.27 and 0.537.
1-sample proportions test with continuity correction
data: 21 out of 52, null probability 0.5
X-squared = 1.5577, df = 1, p-value = 0.212
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2731269 0.5487141
sample estimates:
p
0.4038462
R Conclusion:
We are 95% confident that the (population) proportion of individuals with an immune response is between 0.273 and 0.549.
Remind ourselves of the Normal approximation of the binomial distribution and define the sampling distribution of a sample proportion
Run a hypothesis test for a single proportion and interpret the results.
Construct and interpret confidence intervals for a single proportion.
Run a hypothesis test for a difference in proportions and interpret the results.
Construct and interpret confidence intervals for a difference in proportions.
Population parameter
Population 1 proportion: \(p_1\), \(\pi_1\) (“pi”)
Population 2 proportion: \(p_2\), \(\pi_2\) (“pi”)
Sample statistic (point estimate)
Sample 1 proportion: \(\hat{p}_1\), \(\hat{\pi}_1\) (“pi”)
Sample 1 proportion: \(\hat{p}_2\), \(\hat{\pi}_2\) (“pi”)
\[\hat{p}_1 - \hat{p}_2 \sim N \left(\mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2, ~~ \sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{ \frac{p_1\cdot(1-p_1)}{n_1} + \frac{p_2\cdot(1-p_2)}{n_2}} \right)\]
Calculate CI for the proportion difference \(p_1 - p_2\):
\[\hat{p}_1 - \hat{p}_2 \pm z^* \cdot SE_{\hat{p}_1 - \hat{p}_2}\]
Run a hypothesis test:
Hypotheses
\[\begin{align} H_0:& p_1 - p_2 = 0 \\ H_A:& p_1 - p_2 \neq 0 \\ (or&~ <, >) \end{align}\]
Test statistic
\[ z_{\hat{p}_1 - \hat{p}_2} = \frac{\hat{p}_1 - \hat{p}_2}{SE_{pool}} \]
Remind ourselves of the Normal approximation of the binomial distribution and define the sampling distribution of a sample proportion
Run a hypothesis test for a single proportion and interpret the results.
Construct and interpret confidence intervals for a single proportion.
Understand how CLT applies to a difference in binomial random variables
A 30-year study to investigate the effectiveness of mammograms versus a standard non-mammogram breast cancer exam was conducted in Canada with 89,835 participants. Each person was randomized to receive either annual mammograms or standard physical exams for breast cancer over a 5-year screening period.
By the end of the 25-year follow-up period, 1,005 people died from breast cancer. The results are summarized in the following table.
Group |
Death from breast cancer?
|
Total | |
---|---|---|---|
Yes | No | ||
Control Group | 505 | 44405 | 44910 |
Mammogram Group | 500 | 44425 | 44925 |
Total | 1005 | 88830 | 89835 |
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
\[\text{pooled proportion} = \hat{p}_{pool} = \dfrac{\text{total number of successes} }{ \text{total number of cases}} = \frac{x_1+x_2}{n_1+n_2}\]
\[\hat{p}_{pool} = \frac{x_1+x_2}{n_1+n_2} = \frac{500 + 505}{(500 + 44425) + (505 + 44405)} = 0.01119\]
Conditions:
Two samples test
\(H_0\): The difference in population proportions of deaths from breast cancer among people who received annual mammograms and annual physical check-ups is 0.
\(H_A\): The difference in population proportions of deaths from breast cancer among people who received annual mammograms and annual physical check-ups is not 0.
\[\begin{align} H_0:& p_{mamm} - p_{ctrl} = 0\\ H_A:& p_{mamm} - p_{ctrl} \neq 0\\ \end{align}\]
Sampling distribution of \(\hat{p}_1 - \hat{p}_2\): \[\hat{p}_1 - \hat{p}_2 \sim N \left(\mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2, ~~ \sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{ \frac{p_1\cdot(1-p_1)}{n_1} + \frac{p_2\cdot(1-p_2)}{n_2}} \right)\]
Since we assume \(H_0: p_1 - p_2 = 0\) is true, we “pool” the proportions of the two samples to calculate the SE:
\[\text{pooled proportion} = \hat{p}_{pool} = \dfrac{\text{total number of successes} }{ \text{total number of cases}} = \frac{x_1+x_2}{n_1+n_2}\]
Test statistic:
\[ \text{test statistic} = z_{\hat{p}_1 - \hat{p}_2} = \frac{\hat{p}_1 - \hat{p}_2 - 0}{\sqrt{\frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}(1-\hat{p}_{pool})}{n_2}}} \]
From our example: Recall that \(\hat{p}_1 = \dfrac{500}{44925}= 0.0111\), \(\hat{p}_2 = \dfrac{505}{44910}= 0.0112\), \(n_1=44925\), \(n_2=44910\), and \(\hat{p}_{pool} = 0.01119\)
The test statistic is:
\[ \begin{align} z_{\hat{p}_1 - \hat{p}_2} = \frac{\hat{p}_1 - \hat{p}_2 - 0}{\sqrt{\frac{\hat{p}_{pool}\cdot(1-\hat{p}_{pool})}{n_1} + \frac{\hat{p}_{pool}\cdot(1-\hat{p}_{pool})}{n_2}}} = \frac{0.0111 -0.0112}{\sqrt{\frac{0.01119\cdot(1-0.01119)}{44925} + \frac{0.01119\cdot(1-0.01119)}{44910}}} = -0.163933 \end{align} \]
The p-value is the probability of obtaining a test statistic just as extreme or more extreme than the observed test statistic assuming the null hypothesis \(H_0\) is true.
Calculate the p-value:
\[\begin{align} & 2 \cdot P(\hat{p}_1 - \hat{p}_2< 0.0111 - 0.0112) \\ &= P\left(Z_{\hat{p}_1 - \hat{p}_2} < \frac{0.0111 - 0.0112}{\sqrt{\frac{0.01119\cdot(1-0.01119)}{44925} + \frac{0.01119\cdot(1-0.01119)}{44910}}}\right) \\ &= 2 \cdot P(Z_{\hat{p}} > -0.164)\\ &= 0.8697839 \end{align}\]
prop.test()
2-sample test for equality of proportions with continuity correction
data: c(505, 500) out of c(44910, 44925)
X-squared = 0.01748, df = 1, p-value = 0.8948
alternative hypothesis: two.sided
95 percent confidence interval:
-0.001282751 0.001512853
sample estimates:
prop 1 prop 2
0.01124471 0.01112966
prop.test()
estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|
0.01124471 | 0.01112966 | 0.01747975 | 0.8948174 | 1 | -0.001282751 | 0.001512853 | 2-sample test for equality of proportions with continuity correction | two.sided |
\[\begin{align} H_0:& p_{mamm} - p_{ctrl} = 0\\ H_A:& p_{mamm} - p_{ctrl} \neq 0\\ \end{align}\]
Conclusion statement:
Remind ourselves of the Normal approximation of the binomial distribution and define the sampling distribution of a sample proportion
Run a hypothesis test for a single proportion and interpret the results.
Construct and interpret confidence intervals for a single proportion.
Understand how CLT applies to a difference in binomial random variables
Run a hypothesis test for a difference in proportions and interpret the results.
Confidence interval conditions
Hypothesis test conditions
What to use for SE in CI formula?
\[\hat{p}_1 - \hat{p}_2 \pm z^* \cdot SE_{\hat{p}_1 - \hat{p}_2}\]
SE in sampling distribution of \(\hat{p}_1 - \hat{p}_2\)
\[\sigma_{\hat{p}_1 - \hat{p}_2} = \sqrt{ \frac{p_1\cdot(1-p_1)}{n_1} + \frac{p_2\cdot(1-p_2)}{n_2}} \]
Problem: We don’t know what \(p\) is - it’s what we’re estimating with the CI.
Solution: approximate \(p_1\), \(p_2\) with \(\hat{p}_1\), \(\hat{p}_2\):
\[SE_{\hat{p}_1 - \hat{p}_2} = \sqrt{ \frac{\hat{p}_1\cdot(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2\cdot(1-\hat{p}_2)}{n_2}}\]
95% CI for population mean difference \(p_1 - p_2\):
\[\begin{align} \hat{p}_1 - \hat{p}_2 &\pm z^* \cdot SE_{\hat{p}_1 - \hat{p}_2}\\ \hat{p}_1 - \hat{p}_2 &\pm z^* \cdot \sqrt{ \frac{\hat{p}_1\cdot(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2\cdot(1-\hat{p}_2)}{n_2}} \\ 0.01113 - 0.01124 &\pm 1.96 \cdot \sqrt{\frac{0.01113\cdot(1-0.01113)}{44925} + \frac{0.01124\cdot(1-0.01124)}{44910}}\\ 0.35 &\pm 1.96\cdot 0.001\\ 0.35 &\pm 0.002\\ (-0.002&, 0.002) \end{align}\]
Used \(z^*\) = qnorm(0.975)
= 1.96
Interpretation:
We are 95% confident that the difference in (population) proportions of deaths due to breast cancer comparing people who received annual mammograms to annual physical check-ups is between -0.002 and 0.002.
2-sample test for equality of proportions with continuity correction
data: c(505, 500) out of c(44910, 44925)
X-squared = 0.01748, df = 1, p-value = 0.8948
alternative hypothesis: two.sided
95 percent confidence interval:
-0.001282751 0.001512853
sample estimates:
prop 1 prop 2
0.01124471 0.01112966
R Conclusion:
We are 95% confident that the difference in (population) proportions of deaths due to breast cancer comparing people who received annual mammograms to annual physical check-ups is between -0.0013 and 0.0015.
Lesson 15 Slides