[1] 0.302
2024-04-03
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
In BSTA 512/612 (linear regression), we focused on continuous responses/outcomes
We included categorical variables only as covariates (aka predictors, independent variables, explanatory variables)
Examples from 512/612: life expectancy (in years), IAT score (ranging from -2 to 2)
Categorical data analysis focuses on the statistical methods for categorical responses/outcomes
Strategies for assessing association between categorical response variable and a one explanatory variable
Statistical modeling strategies for assessing association between the categorical response variable and a set of explanatory variables
Logistic regression
Poisson regression
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
Consider a sample of \(n\) independent trials, each of which can have only two possible outcomes (“success” and “failure”)
For each trial: \[\begin{align} P( \text{success}) & = p \\ P( \text{failure}) & = 1- p = q \end{align}\]
Binomial distribution: distribution of the number of successes in n independent trials
The probability mass function for the binomial distribution is: \[P(X=k) = {n \choose k} p^k q^{n-k}, \text{ for } k = 0, 1, ..., n\]
Consider a sample of \(n\) independent trials, each of which can have only two possible outcomes (“success” and “failure”)
For each trial: \[\begin{align} P( \text{success}) & = p \\ P( \text{failure}) & = 1- p = q \end{align}\]
Binomial distribution: distribution of the number of successes in n independent trials
The probability mass function for the binomial distribution is: \[P(X=k) = {n \choose k} p^k q^{n-k}, \text{ for } k = 0, 1, ..., n\]
R commands with their input and output:
R code | What does it return? |
---|---|
rbinom() |
returns sample of random variables with specified binomial distribution |
dbinom() |
returns probability of getting certain number of successes |
pbinom() |
returns cumulative probability of getting certain number or less successes |
qbinom() |
returns number of successes corresponding to desired quantile |
Example
If the probability that one white blood cell is a lymphocyte is 0.2, compute the probability of 2 lymphocytes out of 10 white blood cells
\[P(X=2) = {10 \choose 2} 0.2^2 (1-0.2)^{10-2} = 0.3020\]
Also known as: Sampling distribution of \(\widehat{p}\)
IF \(X\sim \text{Binomial}(n,p)\) and \(np>10\) and \(nq = n(1-p) > 10\)
THEN approximately \(𝑋\sim \text{Normal}\big(\mu_X = np, \sigma_X = \sqrt{np(1-p)} \big)\)
Pretty good video behind the intuition of this (Watch 00:00 - 05:40)
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
\[ \widehat{p} = \dfrac{\# \text{successes}}{\# \text{successes} + \# \text{failures}} \]
Use the sampling distribution of \(\widehat{p}\) to construct the confidence interval:
\[\begin{align} \widehat{p} &\pm z^*_{(1-\alpha/2)} \cdot SE_{\hat{p}} \\ \widehat{p} &\pm z^*_{(1-\alpha/2)} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \end{align}\]
Smoking status example
A cross-sectional study of 8681 patients was conducted to evaluate the nature of smoking status among people. Of the 8681 people, 4840 were nonsmokers and 3841 were smokers.
Needed steps:
Smoking status example
A cross-sectional study of 8681 patients was conducted to evaluate the nature of smoking status among people. Of the 8681 people, 4840 were nonsmokers and 3841 were smokers.
\[ \widehat{p} \pm z^*_{0.975} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]
1-sample proportions test with continuity correction
data: 3841 out of 8681, null probability 0.5
X-squared = 114.73, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.4319827 0.4529896
sample estimates:
p
0.4424605
The estimated proportion of smokers is 0.442 (95% CI: 0.432, 0.453).
Additional interpretation of CI: We are 95% confident that the (population) proportion of smokers is between 0.432 and 0.453.
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
Use the sampling distribution of \(\widehat{p}_1\) and \(\widehat{p}_2\) to construct the confidence interval:
\[\begin{align} \widehat{p}_1 - \widehat{p}_2 &\pm z^*_{(1-\alpha/2)} \cdot SE_{\hat{p}_1 - \hat{p}_2} \\ \widehat{p}_1 - \widehat{p}_2 &\pm z^*_{(1-\alpha/2)} \cdot \sqrt{ \frac{\hat{p}_1\cdot(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2\cdot(1-\hat{p}_2)}{n_2}} \end{align}\]
The Strong Heart Study is an ongoing study of American Indians residing in 13 tribal communities in three geographic areas (AZ, OK, and SD/ND) to study prevalence and incidence of cardiovascular disease and to identify risk factors. We will be examining the 4-year cumulative incidence of diabetes with one risk factor, glucose tolerance.
Impaired glucose: normal or impaired glucose tolerance at baseline visit (between 1988 and 1991)
Diabetes: Indicator of diabetes at follow-up visit (roughly four years after baseline) according to two-hour oral glucose tolerance test
There is a total of 1664 American Indians in the dataset, with the following distribution of folks with diabetes and glucose tolerance:
Glucose | Diabetes | Total | |
---|---|---|---|
Not diabetic | Diabetic | ||
Impaired | 334 | 198 | 532 |
Normal | 1004 | 128 | 1132 |
Total | 1338 | 326 | 1664 |
Strong Heart Study
What is the difference in proportions for American Indians that have diabetes comparing individuals with normal vs. impaired glucose?
Needed steps:
Strong Heart Study
What is the difference in proportions for American Indians that have diabetes comparing individuals with normal vs. impaired glucose?
Glucose | Diabetes | Total | |
---|---|---|---|
Not diabetic | Diabetic | ||
Impaired | 334 | 198 | 532 |
Normal | 1004 | 128 | 1132 |
Total | 1338 | 326 | 1664 |
Estimate the difference in proportions \[ \widehat{p}_1 -\widehat{p}_2 = \dfrac{198}{532} - \dfrac{128}{1132} = 0.2591\]
Check that each cell has at least 10 individuals
2-sample test for equality of proportions with continuity correction
data: table(SHS$Glucose, SHS$Diabetes)
X-squared = 152.6, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
0.2126963 0.3055162
sample estimates:
prop 1 prop 2
0.3721805 0.1130742
The estimated difference in proportion of diabetic American Indians comparing is 0.259 (95% CI: 0.213, 0.306).
Additional interpretation of CI: We are 95% confident that the difference in (population) proportions of American Indians who have normal glucose tolerance and impaired glucose tolerance that developed diabetes is between 0.213 and 0.306.
What is a matched-pairs study?
Categorical test that is parallel to the “paired t-test”
R packages and functions
mcnemar.test()
in built-in stats
packagemcnemar.exact()
in exact2x2
package
Can we expand this to ask a more general question about association between a response and explanatory variable?
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
R X C contingency tables
Contains information for two discrete variables: one has R categories and the other has C categories.
Refers to the number of rows (R) and number of columns (C) in the table
For two proportions: focused on 2 X 2 contingency tables
Expand our contingency tables to variables with 2 or more categories
Let’s say we are interested in learning the association between the development of breast cancer and age at first birth. Our first step is typically to present the observed data:
If both variables are nominal, a test of general association will be sufficient
Test of general association is the same regardless of R and C
Test used for 2x2 contingency table same as 5x3 contingency table
We will cover:
If one or both variables are ordinal, a test of trend may be of interest
Treats ordinal variables as quantitative rather than qualitative (nominal scale)
Test of trend has greater power than the test of general association
We will cover:
Translated to a hypothesis test:
We have two options for testing general association:
Main question: Do American Indians with impaired glucose tolerance have a different incidence of diabetes?
We have two variables, and both variables have two nominal categories
Answer research question with a test of general association
Hypothesis:
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
Requirements to conduct Chi-squared test (expected cell counts)
For 2 x 2 contingency table:
For contingency table with 3x2, 3x3, 4x4, etc.:
Check that the expected cell counts threshold is met
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic and p-value for Chi-squared test in R
Write a conclusion to the hypothesis test
expected()
function in the epitools
package
Diabetic Not diabetic
Impaired 198 334
Normal 128 1004
Diabetic Not diabetic
Impaired 104.226 427.774
Normal 221.774 910.226
All expected counts > 5
All expected cells are greater than 5.
\(\alpha = 0.05\)
Hypothesis test:
Pearson's Chi-squared test with Yates' continuity correction
data: SHS_table
X-squared = 152.6, df = 1, p-value < 2.2e-16
We reject the null hypothesis that glucose tolerance and diabetes are not associated (\(p<2.2\cdot10^{-16}\)). There is sufficient evidence that glucose tolerance and diabetes incidence are associated among American Indians.
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
\[P(a, b, c, d) = \dfrac{(a+b)!\cdot(c+d)!\cdot(a+c)!\cdot(b+d)!}{n!\cdot a!\cdot b!\cdot c!\cdot d!}\]
Check the expected cell counts
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic and p-value for Fisher Exact test in R
Write a conclusion to the hypothesis test
We’re going to pretend they are less than 5.
\(\alpha = 0.05\)
Hypothesis test:
Fisher's Exact Test for Count Data
data: SHS_table
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
3.576595 6.048639
sample estimates:
odds ratio
4.644825
We reject the null hypothesis that glucose tolerance and diabetes are not associated (\(p<2.2\cdot10^{-16}\)). There is sufficient evidence that glucose tolerance and diabetes incidence are associated among American Indians.
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
If one or both variables are ordinal, a test of trend may be of interest
Two tests of trend that we we learn:
Cochran-Armitage test
Mantel-Haenszel test
It will test the trend of the proportions over the ordinal variable
Null Hypothesis (\(H_0\))
The proportions of successes are the same across all C ordinal values of the explanatory variable. \[p_1 = p_2 = ... = p_C\]
Alternative Hypothesis (\(H_1\))
The proportions of successes tend to increase as ordinal value of the explanatory variable increases
\[p_1 \leq p_2 \leq ... \leq p_C\]
OR
The proportions of successes tend to decrease as ordinal value of the explanatory variable increases
\[p_1 \geq p_2 \geq ... \geq p_C\]
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic and p-value for Cochran-Armitage test in R
Write a conclusion to the hypothesis test
We are interested in learning the association between the development of breast cancer and age at first birth among people who have given birth
Cancer = c(320, 1206, 1011, 463, 220)
No_Cancer = c(1422, 4432, 2893, 1092, 406)
bscancer = matrix (c(Cancer, No_Cancer), nrow = 2, byrow = T)
rownames(bscancer) = c("Cancer","No Cancer")
colnames(bscancer) = c("<20","20-24","25-29","30-34",">=35")
bscancer
<20 20-24 25-29 30-34 >=35
Cancer 320 1206 1011 463 220
No Cancer 1422 4432 2893 1092 406
\(\alpha = 0.05\)
Hypothesis test:
\[p_1 \leq p_2 \leq ... \leq p_5\]
Cochran-Armitage test for trend
data: bscancer
Z = 11.358, dim = 5, p-value < 2.2e-16
alternative hypothesis: two.sided
We reject the null hypothesis that proportions of breast cancer are the same for all age levels of first birth (\(p<2.2\cdot10^{-16}\)). There is sufficient evidence that the proportion of of breast cancer increase as the the age at first birth increases.
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
When both variables are ordinal, we can conduct Mantel-Haenszel test of trend for linear association
Mantel-Haenszel test for linear trend is suitable for any R x C contingency tables with two ordinal variables
Hypothesis test:
Null Hypothesis (\(H_0\))
There is no correlation between the two variables \[ \rho = 0\]
Alternative Hypothesis (\(H_1\))
There is correlation between the two variables
\[ \rho \neq 0\]
Set the level of significance \(\alpha\)
Specify the null ( \(H_0\) ) and alternative ( \(H_A\) ) hypotheses
Calculate the test statistic and p-value for Mantel-Haenszel test in R
Write a conclusion to the hypothesis test
A water treatment company is studying water additives and investigating how they affect clothes washing (through measurements of abrasions, wearing, and color loss).
The treatments studies where no treatment (plain water), the standard treatment, and a double dose of the standard treatment, called super. Washability was measured as low, medium and high.
Are levels of washability associated with treatment?
water = matrix (c(27, 14, 5, 10, 17, 26, 5, 12, 50), nrow = 3, byrow = T)
rownames(water) = c("plain","standard","super")
colnames(water) = c("low","medium","high")
water
low medium high
plain 27 14 5
standard 10 17 26
super 5 12 50
\(\alpha = 0.05\)
Hypothesis test:
\[ \rho \neq 0\]
Mantel-Haenszel Chi-Square
data: water
X-squared = 50.602, df = 1, p-value = 1.132e-12
We reject the null hypothesis that there is no correlation between washability and water treatment (\(p = 1.13 \cdot 10^{-12} < 0.05\)). There is sufficient evidence that level of water treatment is associated with washability.
Recognize the motivation for and focus of our course (categorical responses)
Recall features of the Binomial distribution for categorical data analysis and utilize the normal approximation
Determine if a single proportion differs from a population value
Determine if two proportions differ from each other
Display data from two categorical variables, each with 2 or more categories, using R X C contingency tables
Determine if a nominal response and nominal explanatory variable are associated with one another using the Chi-squared test
Determine if a nominal response and nominal explanatory variable are associated with one another using the Fisher Exact test
Determine if a binary, nominal response and an ordinal explanatory variable are associated with one another using the Cochran-Armitage test
Determine if an ordinal response and ordinal explanatory variable are associated with one another using the Mantel-Haenszel test
For a refresher or review of one proportion and differences in proportions
And their power calculations
From Meike’s BSTA 511 course (see Day 12!)
For a refresher or review of Chi-squared test or Fisher’s Exact test
Lesson 2: Introduction to Categorical Data Analysis