Lesson 7: Prediction and Visualization in Simple Logistic Regression

Nicky Wakim

2024-04-22

Learning Objectives

  1. Make transformation between logistic regression and estimated/predicted probability.

  2. Construct confidence interval for predicted probability.

  3. Visualize the predicted probability (and its confidence intervals).

Recall our example: Late stage breast cancer diagnosis

  • Recall that we fitted a simple logistic regression for late stage breast cancer diagnosis using the predictor, age:
bc_reg = glm(Late_stage_diag ~ Age_c, data = bc, family = binomial)
tidy(bc_reg, conf.int=T) %>% gt() %>% tab_options(table.font.size = 38) %>%
  fmt_number(decimals = 3)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) −0.989 0.023 −42.637 0.000 −1.035 −0.944
Age_c 0.057 0.003 17.780 0.000 0.051 0.063

 

  • Fitted logistic regression model: \[\text{logit}(\widehat{\pi}(Age)) = -0.989 + 0.057 \cdot Age\]

Recall our example: Late stage breast cancer diagnosis

  • Fitted logistic regression model: \[\text{logit}(\widehat{\pi}(Age)) = -0.989 + 0.057 \cdot Age\]

     

  • Now we want to caclulate the predicted/estimated probability from the above fitted model

  • We will need to calculate the predicted probability and its confidence interval

    • Then we will visualize the fitted probability

Learning Objectives

  1. Make transformation between logistic regression and estimated/predicted probability.
  1. Construct confidence interval for predicted probability.

  2. Visualize the predicted probability (and its confidence intervals).

Predicted Probability

  • We may be interested in predicting probability of having a late stage breast cancer diagnosis for a specific age.

  • The predicted probability is the estimated probability of having the event for given values of covariate(s)

  • In simple logistic regression, the fitted model is:\[\text{logit}(\widehat{\pi}(X)) = \hat{\beta}_0 +{\hat{\beta}}_1X \]

  • We can convert it to the predicted probability: \[\hat{\pi}\left(X\right)=\frac{\exp({\hat{\beta}}_0+{\hat{\beta}}_1X)}{1+\exp({\hat{\beta}}_0+{\hat{\beta}}_1X)}\]

    • This is an inverse logit calculation
  • We can calculate this using the the predict() function like in BSTA 512

    • Another option: taking inverse logit of fitted values from augment() function

Reference: Inverse logit

  • If we have \(\text{logit}(a) = b\), then \[\begin{aligned} \text{logit}(a) & = b \\ \text{log}\left(\dfrac{a}{1-a}\right) & = b \\ \exp \left[ \text{log}\left(\dfrac{a}{1-a}\right) \right] & = \exp[b] \\ \dfrac{a}{1-a} & = \exp[b] \\ a & = \exp[b]\cdot(1-a) \\ a & = \exp[b] - a\cdot \exp[b] \\ a + a\cdot \exp[b]& = \exp[b] \\ a\cdot ( 1 + \exp[b] )& = \exp[b] \\ a& = \dfrac{\exp[b]}{1 + \exp[b]} \\ \end{aligned}\]

Learning Objectives

  1. Make transformation between logistic regression and estimated/predicted probability.
  1. Construct confidence interval for predicted probability.
  1. Visualize the predicted probability (and its confidence intervals).

Confidence Interval of Predicted Probability

  • Not as easy to construct
  • I have searched around for a function that does this for us, but I cannot find one
  • So we have to construct the confidence interval “by hand”

 

There are a two ways to do this:

  1. Construct the 95% confidence interval in the logit scale, then convert to probability scale
  2. Use Normal approximation (if appropriate) to construct confidence interval in probability scale

Option 1: 95% confidence interval in logit scale (1/2)

  • Recall our our fitted simple logistic regression model with a continuous predictor \[\text{logit}(\widehat{\pi}(X)) = \widehat{\beta}_0 + \widehat{\beta}_1 \cdot X\]

  • We can first find the predicted \(\text{logit}(\widehat{\pi}(X))\) and then find the 95% confidence interval around it: \[\text{logit}(\widehat{\pi}(X)) \pm 1.96 \cdot SE_{\text{logit}(\widehat{\pi}(X))}\]

  • We’ll call this 95% CI: \[\left(\text{logit}(\widehat{\pi}(X)) - 1.96 \cdot SE_{\text{logit}(\widehat{\pi}(X))}, \ \text{logit}(\widehat{\pi}(X)) + 1.96 \cdot SE_{\text{logit}(\widehat{\pi}(X))} \right)\] \[\left(\text{logit}_{L}, \ \text{logit}_{U} \right)\]

Option 1: 95% confidence interval in logit scale (2/2)

  • Then we need to convert to the probability scale

  • To convert from \(\text{logit}(\widehat{\pi}(X))\) to \(\widehat{\pi}(X)\), we take the inverse logit

  • Thus, 95% CI in the probability scale is: \[\left(\dfrac{\exp\left[\text{logit}_{L}\right]}{1 + \exp\left[\text{logit}_{L}\right]}, \ \dfrac{\exp\left[\text{logit}_{U}\right]}{1 + \exp\left[\text{logit}_{U}\right]} \right)\]

Option 2: Using Normal approximation

  • If we meet the Normal approximation criteria, we can construct our confidence interval directly in the probability scale

    • We can use the Normal approximation if:

      • \(\widehat{p}n = \widehat{\pi}(X)\cdot n > 10\) and
      • \((1-\widehat{p})n = (1-\widehat{\pi}(X))\cdot n > 10\)

 

  • We can first find the predicted \(\widehat{\pi}(X)\) and then find the 95% confidence interval around it: \[\widehat{\pi}(X) \pm 1.96 \cdot SE_{\widehat{\pi}(X)}\]

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

Needed steps:

  1. Calculate probability prediction

  2. Check if we can use Normal approximation

  3. Calculate confidence interval

    1. Using logit scale then converting
    2. Using Normal approximation
  4. Interpret results

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

  1. Calculate probability prediction
bc_reg = glm(Late_stage_diag ~ Age_c, data = bc, family = binomial)
newdata = data.frame(Age_c = 60 - mean_age)
pred1 = predict(bc_reg, newdata = newdata, se.fit = T, type = "response")
pred1
$fit
        1 
0.2522616 

$se.fit
          1 
0.004709743 

$residual.scale
[1] 1

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

  1. Check if we can use Normal approximation

We can use the Normal approximation if: \(\widehat{p}n = \widehat{\pi}(X)\cdot n > 10\) and \((1-\widehat{p})n = (1-\widehat{\pi}(X))\cdot n > 10\).

n = nobs(bc_reg)
p = pred1$fit
n*p
       1 
2522.616 
n*(1-p)
       1 
7477.384 

We can use the Normal approximation!

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

3a. Calculate confidence interval (Option 1: logit scale, we could skip previous step)

pred1 = predict(bc_reg, newdata = newdata, se.fit = T, type = "link")
LL_CI1 = pred1$fit - qnorm(1-0.05/2) * pred1$se.fit
UL_CI1 = pred1$fit + qnorm(1-0.05/2) * pred1$se.fit
pred_link = c(Pred = pred1$fit, LL_CI1, UL_CI1)

(exp(pred_link)/(1+exp(pred_link))) %>% round(., digits=3)
Pred.1      1      1 
 0.252  0.243  0.262 
inv.logit(pred_link) %>% round(., digits=3)
Pred.1      1      1 
 0.252  0.243  0.262 

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

3b. Calculate confidence interval (Option 2: with Normal approximation)

pred = predict(bc_reg, newdata = newdata, se.fit = T, type = "response")

LL_CI = pred$fit - qnorm(1-0.05/2) * pred$se.fit
UL_CI = pred$fit + qnorm(1-0.05/2) * pred$se.fit

c(Pred = pred$fit, LL_CI, UL_CI) %>% round(digits=3)
Pred.1      1      1 
 0.252  0.243  0.261 

Example: Late stage breast cancer diagnosis

Predicting probability of late stage breast cancer diagnosis

For someone 50 years old, what is the predicted probability for late stage breast cancer diagnosis (with confidence intervals)?

  1. Interpret results

For someone who is 60 years old, the predicted probability of late stage breast cancer diagnosis is 0.252 (95% CI: 0.243, 0.261).

Predicted/Estimated probability

  • Predicted probability is NOT our predicted outcome

    • We cannot interpret it as the predicted \(Y\) for individuals with certain covariate values

    • Example: our predicted probability does not tell us that one individual will or will not be diagnosed with late stage breast cancer

 

  • The predicted probability is the estimate of the mean (i.e., proportion) of individuals at a certain age who are diagnosed with late stage breast cancer

  • We can use the predicted/estimated probability to predict the outcome

Predicted outcome

  • Typically, the predicted probability is the most important thing to use in a clinical setting

 

  • If you ever need to predict the outcome itself (from logistic regression with binary outcome):

    • Remember that the predicted probability can be used in a Bernoulli (or Binomial with \(n=1\)) distribution to find the predicted outcome
  • If outcome is something like counts, then we would use a Poisson distribution

 

  • By putting it back through a Bernoulli/binomial distribution, we are re-introducing the random component of our observed outcome
set.seed(8392)
rbinom(n=1, size=1, prob = pred$fit)
[1] 0
rbinom(n=10, size=1, prob = pred$fit)
 [1] 0 0 0 0 1 0 0 0 1 0

Learning Objectives

  1. Make transformation between logistic regression and estimated/predicted probability.

  2. Construct confidence interval for predicted probability.

  1. Visualize the predicted probability (and its confidence intervals).

We can also make a plot of all the predicted probabilities (1/2)

bc_reg = glm(Late_stage_diag ~ Age_c, data = bc, family = binomial)
bc_aug = augment(bc_reg)
  • Then we plot the fitted values from the fitted model
library(boot) # for inv.logit()
prob_stage = ggplot(data = bc_aug, aes(x=Age_c, y = inv.logit(.fitted))) + 
  # geom_point(size = 4, color = "#70AD47", shape = 1) +
  geom_smooth(size = 4, color = "#70AD47") +
  labs(x = "Age centered (yrs)", 
       y = "Estimated probability of \n Late stage BC diagnosis")  + 
  theme_classic() +
  theme(axis.title = element_text(size = 30), 
        axis.text = element_text(size = 25), 
        title = element_text(size = 30)) +
  ylim(0, 1)

We can also make a plot of all the predicted probabilities (2/2)

  • If we are interested in seeing all the predicted probabilities across the sample’s age range

  • Note that the probabilities do not need to fill the full range of 0 to 1.

We can add the confidence intervals (1/3)

newdata2 = data.frame(Age_c = seq(min(bc$Age_c), max(bc$Age_c), by = 0.1))
pred2 = predict(bc_reg, newdata = newdata2, se.fit = T, type = "link")
LL_CI1 = pred2$fit - qnorm(1-0.05/2) * pred2$se.fit
UL_CI1 = pred2$fit + qnorm(1-0.05/2) * pred2$se.fit

with_CI = data.frame(Age_c = newdata2$Age_c, 
                     pred = inv.logit(pred2$fit), 
                     LL = inv.logit(LL_CI1), 
                     UL = inv.logit(UL_CI1))

We can add the confidence intervals (2/3)

prob_stage_CI = ggplot(data = with_CI, aes(x = Age_c)) +
  geom_ribbon(aes(ymin = LL, ymax = UL), fill = "grey") +
  geom_smooth(aes(x=Age_c, y = pred), size = 1, color = "#70AD47") +
  labs(x = "Age centered (yrs)", 
       y = "Estimated probability of \n Late stage BC diagnosis")  + 
  theme_classic() +
  theme(axis.title = element_text(size = 30), 
        axis.text = element_text(size = 25), 
        title = element_text(size = 30)) +
  ylim(0, 0.6)

We can add the confidence intervals (3/3)

Poll Everywhere Question

Visualization of observed outcome and fitted model

\[\text{logit}(\widehat{\pi}(Age)) = -0.989 + 0.057 \cdot Age\]

\[\widehat{\pi}(Age) = \dfrac{ \exp \left[-0.989 + 0.057 \cdot Age \right]}{1+\exp \left[-0.989 + 0.057 \cdot Age \right]}\]

Visualization of odds ratios?

  • We will discuss this more on Wednesday when we look at interpretations of ORs

Learning Objectives

  1. Make transformation between logistic regression and estimated/predicted probability.

  2. Construct confidence interval for predicted probability.

  3. Visualize the predicted probability (and its confidence intervals).