forked from acohenstat/STA6257_Project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathslides.qmd
354 lines (207 loc) · 12.6 KB
/
slides.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
---
title: "Inhale, Exhale, Analyze: BMI's Imprint on Impulse Oscillometry Outcomes"
subtitle: UWF STA 6257 Capstone Project on Linear Mixed Models (LMMs)
format:
clean-revealjs:
self-contained: true
preview-links: true
slide-number: false
code-line-numbers: true
logo: images/logo.png
css: styles.css
author:
- name: Joshua J. Cook, M.S., ACRP-PM, CCRC
orcid: 0000-0003-3508-7065
email: [email protected]
- name: Syed Ahzaz H. Shah, B.S.
email: [email protected]
- name: Jacob Hernandez, B.S.
email: [email protected]
- name: Sara Basili, M.S.
email: [email protected]
date: last-modified
bibliography: references.bib
csl: asa.csl
---
```{r}
#| include: false
if (!requireNamespace(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"), quietly = TRUE)) {
install.packages(c("tidyverse", "lme4", "nlme", "Matrix", "gt", "RefManageR", "DataExplorer", "gtsummary", "car"))
}
library(tidyverse)
library(lme4)
library(nlme)
library(gt)
library(gtsummary)
library(RefManageR)
library(DataExplorer)
library(Matrix)
library(car)
library(reshape2)
```
# Introduction to Linear Mixed Models (LMMs) {background-color="#40666e"}
## Introduction
### Understanding Linear Mixed-Effects Models (LMMs)
- **Linear mixed-effects models** are advanced statistical tools designed to handle complex data structures.
- These models are essential when dealing with **hierarchical organization**, **repeated measures**, and **random effects** in datasets.
- LMMs are particularly useful when traditional **ANOVA or regression assumptions**—like independence of observations, homoscedasticity, and normality of residuals—**are not met.**
## Software Tools and Resources for LMMs
### Tools for Implementing LMMs
- The development and use of LMMs are supported by several software packages and programming languages.
- Key resources include the `lme4` package in **R**, detailed by Bates et al. (2015), which simplifies the fitting of mixed models, especially those with crossed random effects.
- For Python users, `Pymer4` developed by Jolly (2018) integrates **Python** with R's lme4 package, broadening accessibility to these advanced methods.
## Applications of LMMs Across Disciplines
### Broad Applications of LMMs
- LMMs find **diverse applications** across various scientific domains, addressing unique analytical challenges.
- In healthcare, LMMs model pandemic-related mortality changes (Verbeeck et al., 2023) and analyze longitudinal data in clinical trials (Touraine et al., 2023).
- In ecology, studies by Harrison et al. (2018) and Bolker et al. (2009) discuss their use in analyzing complex ecological data.
- In psychology and neuroscience, LMMs tackle the complexities of repeated measures and nested data structures (Magezi, 2015; Aarts et al., 2015).
# Methods - Mathematical Foundations {background-color="#40666e"}
## Linear Algebra {.smaller}
### Foundations
LMMs leverage **linear algebra** and in our case, we are explaining the mathematical concepts for a **two-level longitudinal random intercepts model.** Index *i* is used to denote the participant and index *t* is used to denote the different time points of the observation
$$
Y=X\beta + Zu+ \epsilon
$$
Equation 1: the base linear mixed model.
- **Y** is the [response vector]{.underline}. Shape N x 1 where N is the number of the number of repeated measures
- **X** is the design [matrix for fixed effects]{.underline}. Shape N x p where p is the number of regression coefficients
- **β** is the [vector of regression coefficients.]{.underline} Shape P x 1
- **Z** is the design [matrix for random effects]{.underline}. Shape N x J where J number of subjects
- ***u*** is the [vector of random effects.]{.underline} Shape J x 1 vector
- **ϵ i**s the [vector of residual errors]{.underline}. Shape N x 1 vector
## Assumptions {.smaller}
1. The relationship between the **predictors and response** variable is assumed to be **linear**, within each level of random effects.
2. **Random effects** **(*u*)** are assumed to follow a **normal distribution** with mean zero and variance-covariance matrix G.
$\gamma \sim N(0,G)$
3. **Residual errors (ϵ )** are assumed to follow a **normal distribution** with mean zero and variance-covariance matrix R.
$\epsilon \sim N(0,R)$
4. **Random effects (*u*) and residual errors (ϵ ) are assumed to be independent.**
5. **Homoscedasticity** is assumed for the residuals across all levels of the independent variables.
## Implementation in R {.smaller}
- Data is loaded from a CSV file using the read.csv function
- Fitting Data to LMMs
- The **lme()** function from the `nlme` package has parameters to specify random effects structure and estimation method.
- **lmer()** function from the `lme4` package has similar syntax to the lme() function but differs in how it handles random effects specifications
- Hypothesis Testing
- Evaluated using **F-tests, Likelihood ratio test, and Shapiro-Wilks tests**
# The Capstone Project Data {background-color="#40666e"}
## Dataset Overview
- Key attributes and measurements in the dataset.
- Categorical and numerical variables.
- Presence of **missing values**, espsecially in the `Fres_PP` variable.
## Why Linear Mixed Models (LMMs)?
- Suitability of LMMs for the dataset.
- **Multiple observations over time** for the same participants.
- Handling **unbalanced groups**, as observed in participant dropout over time.
## EDA - Categorical Variables {.smaller}
![](images/Frequency_Plots.jpg)
## EDA - Numerical Variables {.smaller}
![](images/qq_plots.jpg)
## Outlier Detection and Summary Statistics {.smaller}
- Presence of **outliers** in variables and their implications.
![](images/box_plot.jpg)
## Participant Dropout Analysis {.smaller}
- **Significance of participant dropout over time.**
- Ability of LMMs to **handle unbalanced groups**
![](images/countplot.jpg)
# Analysis & Results {background-color="#40666e"}
## The Initial Model
### One Random Effect
In this dataset:
- Measures of airway resistance and reactance are the [**variables of interest**]{.underline}: `R5Hz_PP`, `R20Hz_PP`, `X5Hz_PP`, `Fres_PP`.
- Controlled variables are present such as `Group`, `Age`, `Weight`, `Height`, and other Co-morbidities. These are the [**fixed effects.**]{.underline}
- Random variability may exist between individual observations which are nested in each subject. These represent the [**random effects.**]{.underline} In the [**initial model**]{.underline}, `Subject_ID` was treated as the sole *random effect*.
## The Initial Model {.scrollable}
### One Random Effect
![](images/clipboard-4283912119.png){width="1638"}
## The Initial Model
### One Random Effect
![Equation 2. The initial LMM.](images/initial_model.png){fig-align="center"}
## Implementation
```{r}
#| eval: false
#| echo: true
#| code-line-numbers: "1-9|5|6|8|11-16|14"
#lme()
# Fit models using a tidy and clear approach
model_lme <- lme(
fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg,
random = list(Subject_ID = pdIdent(~1)),
data = x_clean,
method = "REML"
)
#lmer()
model_lmer <- lmer(
formula = R5Hz_PP + R20Hz_PP + X5Hz_PP + Fres_PP ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + (1 | Subject_ID),
data = x_clean
)
```
## Evaluation {.smaller .scrollable}
- **Akaike Information Criterion (AIC)** - indicator of model fit without unnecessary complexity.
- AIC for lme = 1898.95 **(selected as initial model)**
- AIC for lmer = 2517.37
- Assumptions Check - **normality**.
![](images/clipboard-1796225568.png){width="452"}
![](images/clipboard-87669187.png){width="447"}
![](images/clipboard-1829089058.png){width="450"}
**Finding:** the residuals [**were not**]{.underline} normally distributed, so this model does not satisfy the assumptions of LMMs.
## The Imputed Model
### Satisfying Assumptions
- Upon further inspection, **outliers were present** in most variables.
- To improve model performance, these **outliers were imputed using the threshold values *(i.e., winsorization).***
- Confirmation of outlier removal was completed using **boxplots**.
- All metrics were then **reevaluated**.
## Evaluation {.smaller .scrollable}
**AIC** for lme = 1790.91 **(better!)**
![](images/clipboard-1896923212.png){width="446"}
![](images/clipboard-1410964575.png){width="445"}
![](images/clipboard-3092760633.png){width="443"}
**Finding:** the residuals [were]{.underline} normally distributed, so this **model does satisfies the assumptions of LMMs.**
## The Final Model {.smaller}
### Two Random Effects and Final Fixed Effect
This was a **longitudinal study** involving multiple observations for each subject over time, and subjects are grouped into **two categories** (children with [sickle cell disease]{.underline} and African-American children with [asthma]{.underline}).
Thus, in this final model:
- we modeled **`Group`** as a *fixed effect* since we were interested in the effect of the group itself on the outcome.
- **`Subject_ID`** should be a *random effect* to account for the repeated measures within subjects.
- **`Observation_number`** was included as a *random slope* within **`Subject_ID`** (i.e., nested within Subject_ID).
- The **same visualizations and tests** were completed to assess the LMM assumptions.
## The Final Model
![Equation 3. The final LMM.](images/final_model.png){fig-align="center"}
## Implementation
```{r}
#| eval: false
#| echo: true
#| code-line-numbers: "|1|3"
model_lme_imputed_final <- lme(fixed = cbind(R5Hz_PP, R20Hz_PP, X5Hz_PP, Fres_PP) ~ BMI + Asthma + ICS + LABA + Gender + Age_months + Height_cm + Weight_Kg + Group,
data = x_clean_imputed,
random = list(Subject_ID = pdIdent(~1 + Observation_number)),
method = "REML")
```
## Evaluation {.smaller .scrollable}
- **AIC** for lme = 1801.60 (better than initial, but worse than imputed?)
![](images/clipboard-3999468873.png){width="432"}
![](images/clipboard-2247813014.png){width="430"}
![](images/clipboard-4002798640.png){width="431"}
![](images/clipboard-3825756155.png){width="429"}
**Findings:**
- The residuals [were]{.underline} normally distributed, so this **model does satisfies the assumptions of LMMs.**
- The AIC penalizes model complexity to avoid overfitting, suggesting that the added effects of Group and Observation_number **may not be sufficiently increasing model accuracy compared to complexity.**
- However, these effects may still be relevant given the research goal of the project despite the slight increase in AIC, **and thus will be left in the final model.**
# Conclusion {background-color="#40666e"}
## Overview of Model Evaluations {.smaller .scrollable}
- In our analysis, we compared three Linear Mixed Models: the **base model**, the **model with imputed values**, and the **final adjusted model**, to [predict airway resistance and reactance effectively.]{.underline}
- We focused on **Mean Squared Error (MSE)** and **Mean Absolute Error (MAE)** to assess [model performance.]{.underline}
![](images/Figure22.png){width="432"}
![](images/Figure23.png){width="432"}
- **Findings:** The **final imputed model** achieved the [lowest MSE and MAE, indicating superior performance over the other models.]{.underline}
## Sample Predictions vs. Actual Data {.smaller}
![](images/Figure24.png){width="432"}
- Figure 24 illustrates a side-by-side comparison of the **predicted versus actual values** for `R5Hz_PP`, a measure of airway resistance and reactance, for **10 random subjects.**
- The **close alignment** between predicted and actual values **represents a low residual error,** confirming the **model's high accuracy** in predicting `R5Hz_PP`.
## Conclusion
- Our analysis demonstrates that **linear mixed models are exceptionally versatile and can effectively handle complex datasets with multiple layers of correlation and missing data**, incorporating both [fixed]{.underline} and [random]{.underline} effects seamlessly.
- **Our final model accurately predicts airway resistance and reactance** given demographic and co-morbidity data, which could aid in better understanding and managing respiratory functions in children with conditions such as [Sickle Cell Disease]{.underline} and [asthma]{.underline}.
## Acknowledgements
The authors thank **Dr. Achraf Cohen**, for his ongoing mentorship and support.
[Questions are welcome and encouraged!]{.underline}