-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path1_main_analysis.Rmd
348 lines (286 loc) · 13.9 KB
/
1_main_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
---
title: "Analysis of BMJ Open badges study"
author: "Adrian Barnett and Anisa Rowhani-Farid"
date: "`r format(Sys.time(), '%d %B %Y')`"
output: word_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, error=FALSE, comment='', dpi=400)
options(width=1000) # Wide pages
options(scipen=999) # avoid scientific presentation
# basic summary functions
Missing = function(x) base::sum(is.na(x))
Mean = function(x) base::mean(x, na.rm=TRUE)
Median = function(x) stats::quantile(x, probs=0.5, na.rm=TRUE)
Q1 = function(x) stats::quantile(x, probs=0.25, na.rm=TRUE)
Q3 = function(x) stats::quantile(x, probs=0.75, na.rm=TRUE)
Min = function(x) base::min(x, na.rm=TRUE)
Max = function(x) base::max(x, na.rm=TRUE)
Sum = function(x) base::sum(x, na.rm=TRUE)
SD = function(x) stats::sd(x, na.rm=TRUE)
N = function(x) base::length(x)
# libraries
library(tm) # for word frequency
library(dplyr)
library(exact2x2) # for CI for Fisher's test
library(tables)
library(broom)
library(tidyr)
library(pander)
library(corpus) # for phrases
panderOptions('table.emphasize.rownames', FALSE)
panderOptions('keep.trailing.zeros', TRUE)
panderOptions('table.split.table', Inf)
panderOptions('table.split.cells', Inf)
panderOptions('big.mark', ',')
library(ggplot2)
g.theme = theme_bw() + theme(panel.grid.minor = element_blank())
cbPalette = c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# function to round with trailing zeros
roundz = function(x, digits=0){formatC( round( x, digits ), format='f', digits=digits)}
# get the data
load('data/AnalysisReady.RData') # from 0_read_data.R
# logical to scramble the treatment group (used later)
scramble = FALSE
```
# Scrambled treatment group
This document contains the statistical analyses for the _BMJ Open_ badges study.
The initial document was produced using a scrambled intervention group by randomly assigning participants to intervention or control wards.
This allowed us to finalise the statistical analyses plan and ensure that all investigators understood the analyses before the real intervention was used.
The descriptive statisics at baseline are presented using the real treatment groups as this does not influence the choice of the final analysis.
We do not use statistical tests to compare the two randomised groups at baseline as such tests are hard to interpret and are not recommended by the CONSORT guidelines.
This report and the statistical analysis were made using Rmarkdown with R version 3.6.0 (R Core Team 2019).
# Participant recruitment
### CONSORT flow chart
```{r CONSORT, fig.width=7.5, fig.height=7, dpi=400}
source('1_consort_plot.R')
par(mai=c(0,0.04,0.04,0.04))
plotmat(M, pos = pos, name = labels, lwd = 1, shadow.size=0, curve=0,
box.lwd = 2, cex.txt = 1, box.size = frame$box.size, box.col=frame$box.col,
box.type = frame$box.type, box.prop = frame$box.prop, txt.col = tcol)
# add left-aligned text; -0.185 controls the horizontal indent
for (i in to.blank){
text(x=pos[i,1] - 0.185, y=pos[i,2], adj=c(0,0.5), labels=labels[i]) # minus controls text position
}
# extra arrow to excluded
arrows(x0=0.5, x1=0.55, y0=0.82, length=0.12)
```
### Plot of cumulative numbers over time
The plot below is based on those papers that were not initially excluded because they were a meta-analysis or systematic review.
```{r recruit.dates}
post.exclusion = filter(data, is.na(random.factor) == FALSE & study_type.factor != 'Meta-analysis and Systematic Review') # remove initial exclusions, but keep those that were eventually rejected
tplot = ggplot(data=post.exclusion, aes(x=qut_recruitment_date, col=random.factor))+
stat_ecdf(size=1.1)+
theme_bw()+
xlab('')+
ylab('Cumulative distribution')+
scale_color_manual('Group', values=cbPalette[2:3])+
theme(legend.position = c(0.8,0.2))
tplot
# for text below
recruit = summarise(post.exclusion, min = min(qut_recruitment_date),
max = max(qut_recruitment_date))
```
There was a halt in recruitment during Christmas. The range of recruitment dates was from `r recruit$min` to `r recruit$max`, which is `r as.numeric(recruit$max - recruit$min + 1)` days.
# Compare the control and intervention groups at baseline
The table below is based on those papers that were not excluded because they were a meta-analysis or systematic review.
### Table of categorical variables
```{r categorical.variables}
cat.tab = tabular( (Heading('Type of Study')*study_type.factor +
Heading('Has the article been accepted for publication at BMJ Open?')*article_accepted.factor +
Heading('Journal article publication status')*publication_status.factor +
Heading('Has the participant withdrawn from the Study?')*withdrawn.factor
)~(Heading(' ')*random.factor)*((n=1) + Heading('%')*Percent('col')*Format(digits=0)), data=post.exclusion)
pander(cat.tab)
```
# Primary outcome
```{r scramble, include=FALSE}
# scramble from now on
if(scramble == TRUE){
set.seed(123456)
post.exclusion$random = sample(post.exclusion$random, size=nrow(post.exclusion), replace=F)
post.exclusion$random.factor = factor(post.exclusion$random,levels=c("1","2"))
levels(post.exclusion$random.factor)=c("control","intervention")
}
# data for analysis, not excluded, paper published, and not withdrawn
for.analysis = filter(post.exclusion, excluded.factor=='No', publication_status.factor == 'Published', withdrawn.factor == "No")
```
### Did the authors receive a badge?
```{r badge}
# badge.factor
tab = tabular( (Heading('Awarded a badge')*badge.factor
) + 1~(Heading('')*random.factor)*((n=1) + Heading('%')*Percent('col')*Format(digits=0)), data=for.analysis)
pander(tab)
# test and confidence interval
tab = with(for.analysis, table(badge.factor, random.factor))
fisher = exact2x2(tab, tsmethod="minlike")
```
We used the Fisher's exact test because of the small cell sizes.
The odds ratio for awarding badges in the intervention group relative to the control is `r roundz(1/fisher$estimate, 1)` with a 95% confidence interval from `r roundz(1/ fisher$conf.int[2], 1)` to `r roundz( 1/fisher$conf.int[1], 1)`.
The p-value from the Fisher's exact test is `r format.pval(fisher$p.val, eps=0.001, digits=3)`.
### Data sharing statement
```{r data.statement}
# badge.factor
tab = tabular( (Heading('Final Data Sharing Statement')*data_sharing_statement.factor
) + 1 ~(Heading('')*random.factor)*((n=1) + Heading('%')*Percent('col')*Format(digits=0)), data=for.analysis)
pander(tab)
```
The data sharing statement has more categories than the binary badge (yes or no) variable.
# Secondary outcomes
## Words frequently used in the final data sharing statements (secondary outcome)
Here we examine the words commonly used in the final data sharing statements to examine whether the intervention had an effect on the language used.
```{r words.used, include=FALSE}
# "We will collect the statements in every “Data sharing statement” and use a word frequency table to compare the two study arms. We will first remove common words such as: at, the, there, their, etc. "
# process words into corpus
control = filter(for.analysis, random.factor=='control')$data_sharing_statement_verbatim_postqut
intervention = filter(for.analysis, random.factor=='intervention')$data_sharing_statement_verbatim_postqut
word.freq = function(inwords){
docs = Corpus(VectorSource(inwords))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuation
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Combine data/dataset/datasets
words <- 'datasets|dataset|data'
tran <- 'data/dataset'
docs <- tm_map(docs, content_transformer(function(x) gsub(x, pattern = words, replacement = tran)))
#
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
rownames(d) = NULL
return(d)
} # end of function
# function to prepare data for phrases
word.phrases = function(inwords){
docs = Corpus(VectorSource(inwords))
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove punctuation
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
return(docs)
} # end of function
```
#### Top 20 words in control group
```{r top20control}
control.freq = word.freq(control)
pander(head(control.freq, 20), style='simple')
```
#### Top 20 words in intervention group
```{r top20intervention}
intervention.freq = word.freq(intervention)
pander(head(intervention.freq, 20), style='simple')
```
#### Differences in top 20 words
The plot below shows the difference in ranking in the top 20 words between the two groups.
```{r top.20.plot, fig.width=7, fig.height=7}
control.freq = mutate(control.freq,
x=1,
group = 'Control',
rank= 1:n())
intervention.freq = mutate(intervention.freq,
x=3,
group = 'Intervention',
rank = 1:n())
to.plot = bind_rows(control.freq, intervention.freq) %>%
filter(rank <= 20) # top 20
# make data for arrows
from = filter(to.plot, x==1) %>%
rename(y = rank) %>%
mutate(x = x + 0.4) # move to the right
to = filter(to.plot, x==3) %>%
mutate(x = x - 0.4) %>% # move to the left
rename(xend=x, yend=rank) %>%
select(-freq, -group)
arrows = full_join(from, to, by='word') %>%
mutate(x = ifelse(is.na(x), 1.5 - 0.1, x),
xend = ifelse(is.na(xend), 2.5 +0.1, xend),
y = ifelse(is.na(y), 21, y), # move to outside 20 top
yend = ifelse(is.na(yend), 21, yend))
# plot
tile.plot = ggplot(to.plot, aes(x=x, y=rank, width=0.8, fill=freq)) +
geom_tile( )+
geom_text(aes(label=word)) +
scale_y_reverse(breaks=c(20,15,10,5,1))+ # high to low rank
scale_x_continuous(breaks=c(1,3), labels=c('Control','Intervention'))+
xlab('')+
ylab('Rank')+
scale_fill_gradient('Word\nfrequency', low = "white", high = "red")+
geom_segment(data=arrows, aes(x=x, xend=xend, y=y, yend=yend))+
g.theme+
geom_text(aes(x=1, y=21, label='Outside top 20'), col=grey(0.5))+
geom_text(aes(x=3, y=21, label='Outside top 20'), col=grey(0.5))
tile.plot
jpeg('figures/top20words.jpg', width=6, height=7, units='in', res=300, quality = 100)
print(tile.plot)
invisible(dev.off())
```
## Look at three-word common phrases
### Control group
```{r control.phrases}
control.phrases = word.phrases(control)
phrases = term_stats(control.phrases, ngrams = 3)
pander(head(phrases, 10), style='simple')
```
### Intervention group
```{r interveion.phrases}
intervention.phrases = word.phrases(intervention)
phrases = term_stats(intervention.phrases, ngrams = 3)
pander(head(phrases, 10), style='simple')
```
## Number of words used in the final data sharing statements (secondary outcome)
Here we compare the number of words used to examine whether those in the intervention group tended to write longer data sharing statements.
#### Boxplot by group
```{r word.count.boxplot}
# "We will also compare the average number of words per data sharing statement and calculate the difference and 95% confidence interval of the difference.""
bplot = ggplot(data=for.analysis, aes(x=random.factor, y=n.words))+
geom_boxplot()+
theme_bw()+
xlab('')+
ylab('Number of words')
bplot
```
#### Poisson regression model of word counts
```{r word.count.model}
model = glm(n.words ~ random.factor, data=for.analysis, family=quasipoisson())
s = summary(model)
parms = tidy(model, conf.int=TRUE, conf.level=0.95) %>%
mutate(term = gsub('random.factor|\\(|\\)', '', term),
p.value = format.pval(p.value, eps=0.001, digits=3)) %>%
rename('mean' = 'estimate', lower=conf.low, upper=conf.high) %>%
select('term','mean','lower','upper','p.value')
pander(parms, style='simple', digits=2)
# for text below
rr = roundz(as.numeric(exp(model$coefficients[2])), digits=2)
ci = confint(model)
rr.lower = roundz(exp(ci[2,1]), digits=2)
rr.upper = roundz(exp(ci[2,2]), digits=2)
```
We used a Poisson regression model with over-dispersion to examine the association between the intervention and the number of words in the data sharing statement.
The table shows the parameter estimates on the log-scale.
There was no association between the intervention and the number of words. The mean rate ratio was `r rr` with a 95% confidence interval from `r rr.lower` to `r rr.upper`.
## Time needed, table of summary statistics
Time needed (in minutes) by the QUT study team to contact authors and verify the datasets.
We present the summary statistics for both groups combined.
```{r time}
to.tab = filter(for.analysis, is.na(time_check)==FALSE) # just non-missing
tab = tabular(time_check ~ (N + Mean*Format(digits=1) + Min + Max), data=to.tab)
pander(tab)
```
# Planned interactions
We will test whether there is an interaction between the main effect of Open Data Badges (intervention or control arm) and:
• Study type: Clinical trials, observational studies, longitudinal studies, surveys, other
Depending on the numbers recruited we may combine small groups.
# Acknowledgements
Adrian Barnett is supported by a National Health and Medical Research Council Senior Research Fellowship (APP1117784).
# References
* R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.