-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlogistic_regression.R
121 lines (94 loc) · 5.59 KB
/
logistic_regression.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
## Regression with binary outcomes
## ═════════════════════════════════
setwd("C:/Users/greed/OneDrive/Documents/Springboard/logistic_regression/logistic_regression")
## Logistic regression
## ───────────────────────
## This far we have used the `lm' function to fit our regression models.
## `lm' is great, but limited–in particular it only fits models for
## continuous dependent variables. For categorical dependent variables we
## can use the `glm()' function.
## For these models we will use a different dataset, drawn from the
## National Health Interview Survey. From the [CDC website]:
## The National Health Interview Survey (NHIS) has monitored
## the health of the nation since 1957. NHIS data on a broad
## range of health topics are collected through personal
## household interviews. For over 50 years, the U.S. Census
## Bureau has been the data collection agent for the National
## Health Interview Survey. Survey results have been
## instrumental in providing data to track health status,
## health care access, and progress toward achieving national
## health objectives.
## Load the National Health Interview Survey data:
NH11 <- readRDS("dataSets/NatHealth2011.rds")
labs <- attributes(NH11)$labels
## [CDC website] http://www.cdc.gov/nchs/nhis.htm
## Logistic regression example
## ───────────────────────────────
## Let's predict the probability of being diagnosed with hypertension
## based on age, sex, sleep, and bmi
str(NH11$hypev) # check stucture of hypev
levels(NH11$hypev) # check levels of hypev
# collapse all missing values to NA
NH11$hypev <- factor(NH11$hypev, levels=c("2 No", "1 Yes"))
# run our regression model
hyp.out <- glm(hypev~age_p+sex+sleep+bmi,
data=NH11, family="binomial")
coef(summary(hyp.out))
## Logistic regression coefficients
## ────────────────────────────────────
## Generalized linear models use link functions, so raw coefficients are
## difficult to interpret. For example, the age coefficient of .06 in the
## previous model tells us that for every one unit increase in age, the
## log odds of hypertension diagnosis increases by 0.06. Since most of us
## are not used to thinking in log odds this is not too helpful!
## One solution is to transform the coefficients to make them easier to
## interpret
hyp.out.tab <- coef(summary(hyp.out))
hyp.out.tab[, "Estimate"] <- exp(coef(hyp.out))
hyp.out.tab
## Generating predicted values
## ───────────────────────────────
## In addition to transforming the log-odds produced by `glm' to odds, we
## can use the `predict()' function to make direct statements about the
## predictors in our model. For example, we can ask "How much more likely
## is a 63 year old female to have hypertension compared to a 33 year old
## female?".
# Create a dataset with predictors set at desired levels
predDat <- with(NH11,
expand.grid(age_p = c(33, 63),
sex = "2 Female",
bmi = mean(bmi, na.rm = TRUE),
sleep = mean(sleep, na.rm = TRUE)))
# predict hypertension at those levels
cbind(predDat, predict(hyp.out, type = "response",
se.fit = TRUE, interval="confidence",
newdata = predDat))
## This tells us that a 33 year old female has a 13% probability of
## having been diagnosed with hypertension, while and 63 year old female
## has a 48% probability of having been diagnosed.
## Packages for computing and graphing predicted values
## ─────────────────────────────────────────────────────────
## Instead of doing all this ourselves, we can use the effects package to
## compute quantities of interest for us (cf. the Zelig package).
library(effects)
plot(allEffects(hyp.out))
## Exercise: logistic regression
## ───────────────────────────────────
## Use the NH11 data set that we loaded earlier.
## 1. Use glm to conduct a logistic regression to predict ever worked
## (everwrk) using age (age_p) and marital status (r_maritl).
worked.out <- glm(everwrk~age_p+r_maritl,
data=NH11, family="binomial")
summary(worked.out)
## 2. Predict the probability of working for each level of marital
## status.
predmar <- with(NH11, expand.grid(r_maritl = c("1 Married - spouse in household",
"2 Married - spouse not in household",
"4 Widowed", "5 Divorced", "6 Separated",
"7 Never married", "8 Living with partner",
"9 Unknown marital status"),
age_p = mean(age_p, na.rm = TRUE)))
cbind(predmar, predict(worked.out, type = "response", se.fit = TRUE, interval = "confidence", newdata = predmar))
## Note that the data is not perfectly clean and ready to be modeled. You
## will need to clean up at least some of the variables before fitting
## the model.