TB sections 8.3-8.4
2024-12-02
Understand the Chi-squared test and the expected cell counts under the null hypothesis distribution.
Determine if two categorical variables are associated with one another using the Chi-squared test.
What happens when we want to compare two or more groups’ proportions?
Knowing the age of a patient provides important information about the likelihood of hypertension
While the probability of hypertension of a randomly chosen adult is 0.29…
Question: Is there an association between age group and hypertension?
Age Group | Hypertension | No Hypertension | Total |
---|---|---|---|
18-39 yrs | 8836 | 112206 | 121042 |
40-59 yrs | 42109 | 88663 | 130772 |
60+ yrs | 39917 | 21589 | 61506 |
Total | 90862 | 222458 | 313320 |
General wording for hypotheses
Test of “association” wording
\(H_0\): There is no association between the two variables
\(H_A\): There is an association between the two variables
Test of “independence” wording
\(H_0\): The variables are independent
\(H_A\): The variables are not independent
Hypotheses test for example
Test of “association” wording
\(H_0\): There is no association between age and hypertension
\(H_A\): There is an association between age and hypertension
Test of “independence” wording
\(H_0\): The variables age and hypertension are independent
\(H_A\): The variables age and hypertension are not independent
\[P(A \cap B)=P(A)P(B)\]
\[\begin{align} P(18-39 \cap \text{hyp}) &= P(18-39)P(\text{hyp})\\ P(18-39 \cap \text{no hyp}) &= P(18-39)P(\text{no hyp})\\ P(40-59 \cap \text{hyp}) &= P(40-59)P(\text{hyp})\\ P(40-59 \cap \text{no hyp}) &= P(40-59)P(\text{no hyp})\\ P(60+ \cap \text{hyp}) &= P(60+)P(\text{hyp})\\ P(60+ \cap \text{no hyp}) &= P(60+)P(\text{no hyp})\\ \end{align}\]
Age Group | Hypertension | No Hypertension | Total |
---|---|---|---|
18-39 yrs | 8836 | 112206 | 121042 |
40-59 yrs | 42109 | 88663 | 130772 |
60+ yrs | 39917 | 21589 | 61506 |
Total | 90862 | 222458 | 313320 |
\[\begin{align} P(18-39 \cap \text{hyp}) &= \frac{121042}{313320}\cdot\frac{90862}{313320}\\ & ...\\ P(60+ \cap \text{no hyp}) &=\frac{61506}{313320}\cdot\frac{222458}{313320} \end{align}\]
With these probabilities, for each cell of the table we calculate the expected counts for each cell under the \(H_0\) hypothesis that the variables are independent
Expected count of 40-59 years old and hypertension:
\[\begin{align} \text{expected count} &= \dfrac{\text{column total}\cdot \text{row total}}{\text{table total}} \\ &= \dfrac{\text{90862}\cdot \text{130772}}{\text{313320}} \\ &= 37923.55 \end{align}\]
Age Group | Hypertension | No Hypertension | Total |
---|---|---|---|
18-39 yrs | 8836 | 112206 | 121042 |
40-59 yrs | 42109 | 88663 | 130772 |
60+ yrs | 39917 | 21589 | 61506 |
Total | 90862 | 222458 | 313320 |
Age Group | Hypertension | No Hypertension |
---|---|---|
18-39 yrs | 8836 | 112206 |
40-59 yrs | 42109 | 88663 |
60+ yrs | 39917 | 21589 |
Age Group | Hypertension | No Hypertension |
---|---|---|
18-39 yrs | 35101.87 | 85940.13 |
40-59 yrs | 37923.55 | 92848.45 |
60+ yrs | 17836.58 | 43669.42 |
Expected count for cell \(i,j\) :
\[\textrm{Expected Count}_{\textrm{row } i,\textrm{ col }j}=\frac{(\textrm{row}~i~ \textrm{total})\cdot(\textrm{column}~j~ \textrm{total})}{\textrm{table total}}\]
R calculates expected cell counts using the expected()
function in the epitools
package
Make sure dataset is in matrix
form using as.matrix()
Hypertension No_Hypertension
18-39 yrs 8836 112206
40-59 yrs 42109 88663
60+ yrs 39917 21589
Check the assumptions
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic.
Calculate the p-value based on the observed test statistic and its sampling distribution
Write a conclusion to the hypothesis test
Hypotheses test for example
Test of “association” wording
\(H_0\): There is no association between age and hypertension
\(H_A\): There is an association between age and hypertension
Test of “independence” wording
\(H_0\): The variables age and hypertension are independent
\(H_A\): The variables age and hypertension are not independent
Test statistic for a test of association (independence):
\[\chi^2 = \sum_{\textrm{all cells}} \frac{(\textrm{observed} - \text{expected})^2}{\text{expected}}\]
\[\begin{align} \chi^2 =& \sum\frac{(O-E)^2}{E} \\ =& \frac{(8836-35101.87)^2}{35101.87} + \frac{(112206-85940.13)^2}{85940.13} + \\ & \ldots + \frac{(21589-43669.42)^2}{43669.42} \\ =& 66831 \end{align}\]
Is this value big? Big enough to reject \(H_0\)?
Observed:
Age Group | Hypertension | No Hypertension |
---|---|---|
18-39 yrs | 8836 | 112206 |
40-59 yrs | 42109 | 88663 |
60+ yrs | 39917 | 21589 |
Expected:
Age Group | Hypertension | No Hypertension |
---|---|---|
18-39 yrs | 35101.87 | 85940.13 |
40-59 yrs | 37923.55 | 92848.45 |
60+ yrs | 17836.58 | 43669.42 |
The \(\chi^2\) distribution shape depends on its degrees of freedom
as.matrix()
or table()
chisq.test()
in RRecall the hypotheses to our \(\chi^2\) test:
\(H_0\): There is no association between age and hypertension
\(H_A\): There is an association between age and hypertension
Conclusion statement:
Warning!!
If we fail to reject, we DO NOT say variables are independent! We can say that we have insufficient evidence that there is an association.
Hypertension No_Hypertension
18-39 yrs 35101.87 85940.13
40-59 yrs 37923.55 92848.45
60+ yrs 17836.58 43669.42
All expected cells are greater than 5.
\(\alpha = 0.05\)
Hypothesis test:
4-5. Calculate the test statistic and p-value for Chi-squared test in R
Pearson's Chi-squared test
data: hyp_data2
X-squared = 66831, df = 2, p-value < 2.2e-16
We reject the null hypothesis that age group and hypertension are not associated (\(p<2.2\cdot10^{-16}\)). There is sufficient evidence that age group and hypertension are associated.
Lesson 16 Slides