forked from Prasham8897/Mortality-Analysis-in-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-cause_age_sex.Rmd
412 lines (317 loc) · 20 KB
/
07-cause_age_sex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# Cause vs (Age & Sex)
```{r, echo=FALSE}
library(ggplot2)
library(rlist)
library(anchors)
library(plyr)
library(dplyr)
library(plotly)
library(ggmosaic)
```
<p>
To understand the mortality rates of the people of different age groups, we downloaded datasets from the website http://ghdx.healthdata.org/gbd-results-tool and then plotted the basic plots relating the number of deaths to different age groups and sex. \
As the ages were grouped this way in the actual dataset source, we preferred to take it this way. There is no particular reason for choosing unequal intervals of age. \
The plots and information stated corresponds to data collectced from the year 1990 to 2016. Also, wherever average is mentioned, it is an average taken per annum from 1990 to 2017.
</p>
```{r, echo=FALSE}
###CODE TO READ DATA FROM THE FILES
#reading csv files into a dataframes and
#preprocessing, forming proper datasets for plotting---------------------
csv <- read.csv(file = 'total_all_causes.csv', header = TRUE)
data_all <- as.data.frame(csv)
#preprocessing and forming proper datasets for plotting------------------
#--------------------------
#over all data
data_all_2 <- data_all[which(data_all$metric == 'Number'), ]
data_all_2$age <- plyr::revalue(data_all_2$age, c("1 to 4"="1-4 years"))
death_numbers_2 <- data_all_2 %>%
group_by(age, sex) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
death_numbers_2$age <- factor(death_numbers_2$age, levels = c("<1 year", "1-4 years", "5-14 years", "15-49 years", "50-69 years", "70+ years"))
#--------------------------
#injuries
csv3 <- read.csv(file = 'injuries.csv', header = TRUE)
data_injuries <- as.data.frame(csv3)
data_injuries2 <- data_injuries[which(data_injuries$metric == 'Number'), ]
data_injuries2$age <- plyr::revalue(data_injuries2$age, c("1 to 4"="1-4 years"))
data_injuries3 <- data_injuries2 %>%
group_by(age, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
data_injuries3$age <- factor(data_injuries3$age, levels = c("<1 year", "1-4 years", "5-14 years", "15-49 years", "50-69 years", "70+ years"))
#---------------------------
#Deatsh due to communicable diseases
csv4 <- read.csv(file = 'communicable-1.csv', header = TRUE)
csv5 <- read.csv(file = 'communicable-2.csv', header = TRUE)
df4 <- as.data.frame(csv4)
df5 <- as.data.frame(csv5)
df4 <- df4[which(df4$metric == 'Number'), ]
df5 <- df5[which(df5$metric == 'Number'), ]
df_comm <- rbind(df4, df5)
df_comm$age <- plyr::revalue(df_comm$age, c("1 to 4"="1-4 years"))
data_comm_a <- df_comm %>%
group_by(age, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
data_comm_a$age <- factor(data_comm_a$age, levels = c("<1 year", "1-4 years", "5-14 years", "15-49 years", "50-69 years", "70+ years"))
#---------------------------
#Deatsh due to communicable diseases
csv6 <- read.csv(file = 'non_communicable-1.csv', header = TRUE)
csv7 <- read.csv(file = 'non_communicable-2.csv', header = TRUE)
csv8 <- read.csv(file = 'non_communicable-3.csv', header = TRUE)
df6 <- as.data.frame(csv6)
df6 <- df6[which(df6$metric == 'Number'), ]
df7 <- as.data.frame(csv7)
df7 <- df7[which(df7$metric == 'Number'), ]
df8 <- as.data.frame(csv8)
df8 <- df8[which(df8$metric == 'Number'), ]
df_noncomm <- rbind(df6, df7, df8)
df_noncomm$age <- plyr::revalue(df_noncomm$age, c("1 to 4"="1-4 years"))
data_noncomm_a <- df_noncomm %>%
group_by(age, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
data_noncomm_a$age <- factor(data_noncomm_a$age, levels = c("<1 year", "1-4 years", "5-14 years", "15-49 years", "50-69 years", "70+ years"))
```
<h3>Lets start plotting!</h3>
```{r, echo=FALSE}
#plot --- avg deaths vs age (all causes)
death_numbers_2a = aggregate(val_mean ~ age, data=death_numbers_2, FUN=sum)
plot_ly(x = death_numbers_2a$age, y = death_numbers_2a$val_mean, type = "bar") %>%
layout(title = "Avg. no. of deaths vs Age group",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
From the plot, we see that the average number of deaths in the age group 70+ years is maximum and the average number of deaths in the age group 5-14 years is minimum. Also, it is interesting to see that the average number of deaths in the age groups 1-4 and 5-14 are less that the average number of deaths in age groups <1 year and 15-49 years. Overall, the average number of deaths decrease till 5-14 years age group and then increases. \
Why are the values as shown? What are the major causes of death? How do deaths due to various causes vary with age? These are the questions we wish to answer on further exploring and visualizing the dataset. \
</p>
<p>
Let us take a look at the causes of deaths (mentioned in the dataset).
</p>
```{r, echo=FALSE}
#Plot showing various causes of death
df_big <- rbind(data_injuries3, data_comm_a, data_noncomm_a)
df_big = aggregate(val_mean ~ cause, data=df_big, FUN=sum)
#df_big$cause <- factor(data$cause, levels = unique(data$cause)[order(data$val_mean, decreasing = TRUE)])
df_big <- df_big[order(df_big$val_mean),]
plot_ly(df_big, x = ~val_mean, y = ~cause, type = "bar") %>%
layout(title="Causes of death vs Average nunber of deaths",
xaxis = list(title = "Average number of deaths",
zeroline = FALSE, family = "times new roman"),
yaxis = list(title = "Causes of death",
zeroline = FALSE, family = "times new roman",
categoryorder = "array", categoryarray = df_big$cause),
font=list(family="Times New Roman", size=12))
```
<p>
This plot shows the causes of deaths vs average number of deaths. As we see, mental disorders cause the least number of deaths while cardiovascular diseases cause the maximum number of deaths. \
The causes of death can be divided into three categories as follows: 1. injuries, 2. Communicable, maternal, neonatal, and nutritional diseases, and 3. Non-communicable diseases. Let us compare the number of deaths caused by each of these categories for the age groups.\
In this analysis, let us first check the deaths due to injuries of each age group.
</p>
<h3>Injuries</h3>
```{r, echo=FALSE}
injuries_death_numbers = aggregate(val_mean ~ age, data=data_injuries3, FUN=sum)
plot_ly(x = injuries_death_numbers$age, y = injuries_death_numbers$val_mean, type = "bar") %>%
layout(title = "Avg. no. of deaths due to injuries vs Age group",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths due to injuries",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
From this graph, we can see that more number of deaths due to injuries are in the 15-49 years age group. It seems a little natural that this number would be high because the width of this age group is high. Now, let us dive into what the specific causes were, under this subgroup of injuries.
</p>
```{r, echo=FALSE}
plot_ly(data_injuries3, x = ~age, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths due to injuries vs Age",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths due to injuries",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
This plot reveals a lot of information. Check the first three age groups. The people in these age groups and this category of causes of death (injuries) mostly died because of unintentional injuries. The common unintensional injuries include drowning, falls, poisoning, etc. As kids do not know how to handle the situations at their age, there are more chances of them losing life due to unintentional injuries as shown in the plot. Also, self-harm and interpersonal violence are very less in these three age groups. I think one of the most important reason for that is that people usually remain carefree during this age. They usually do not have any real life worries in life. \
In contrast, the number of deaths due to self-harm and interpersonal violence is most in the age group of 15-49 years (middle-age group). The stress levels are pretty high in this age group. The reasons can be many ranging from financial problems to heart breaks. \
In all the age groups, the average number of deaths due to self-harm and interpersoanl violence is less than the other two causes except for the age group 15-49 years. \
Following is a mosaic plot that gives the information of the same data in relative fashion. Here, we do not know the actual values by observing the plot alone. So, bar graph using plotly seem to be a better option than mosaic plots.
</p>
```{r, echo=FALSE, fig.width=12, fig.height=8}
library(dplyr)
data_injuries4 <- data_injuries3 %>% dplyr::select(age, cause, val_mean)
data_injuries4$val_mean <- as.integer(data_injuries4$val_mean)
colnames(data_injuries4) <- c("age", "cause", "Freq")
ggplot(data_injuries4) +
geom_mosaic(
aes(x=product(cause, age), # cut from right to left
weight=Freq,
fill=cause
),
divider=c("vspine" , "hspine") # equivalent to divider=ddecker()
) +
ggtitle("Mosaic plot") +
xlab("Age groups") + ylab("Causes (in injuries category)")
```
<h3>Communicable, maternal, neonatal, and nutritional diseases</h3>
```{r, echo=FALSE}
communicable_death_numbers = aggregate(val_mean ~ age, data=data_comm_a, FUN=sum)
plot_ly(x = communicable_death_numbers$age, y = communicable_death_numbers$val_mean, type = "bar") %>%
layout(title = "Avg. no. of deaths due to communicable diseases vs Age group",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths due to communicable diseases",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
From this graph, we can see that maximum average number of deaths in this category are in the age group of <1 year and the minimum average number of deaths are in the age group 4-15 years. Let us explore further on what this category of causes includes and how each of the causes is responsible for deaths.
</p>
```{r, echo=FALSE}
plot_ly(data_comm_a, x = ~age, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths due to communicable diseases vs Age",
xaxis = list(title = "Age Groups",
zeroline = FALSE, family = "times new roman"),
yaxis = list(title = "Avg. no. of deaths communicable diseases",
zeroline = FALSE, family = "times new roman"),
font=list(family="Times New Roman", size=12))
```
<p>
In the age group <1 year, the one cause that is responsible for maximum number of deaths is 'Maternal and neonatal disorders'. This cause is prominnant only inn two age groups <1 year and 15-49 years. In all other age groups, the average number of deaths due to this is negligible. The age group 15-49 has the maximum average number of deaths dur to 'HIV/AIDS and sexually transmitted infections'. In all the age groups, 'Respiratory infections and tuberculosis' seems to be one the main reasons of deaths under this category causes of deaths. \
Again, following is the mosaic plot that gives us the information about the data in relative sense. Similar to the previous mosaic plot, we do not get to know the true values from mosaic plots. Also, as the set of boxes representing same cause do not start at the same point, it gets difficult to compare too. In addition, it also gets difficult as the number of causes increase. So, we will no longer check the mosaic plots for our graphs when the number of causes are high.
</p>
```{r, fig.width=12, fig.height=8, echo=FALSE}
data_comm_b <- data_comm_a %>% dplyr::select(age, cause, val_mean)
data_comm_b$val_mean <- as.integer(data_comm_b$val_mean)
colnames(data_comm_b) <- c("age", "cause", "Freq")
ggplot(data_comm_b) +
geom_mosaic(
aes(x=product(cause, age), # cut from right to left
weight=Freq,
fill=cause
),
divider=c("vspine" , "hspine") # equivalent to divider=ddecker()
) +
ggtitle("Mosaic plot") +
xlab("Age groups") + ylab("Causes \n (in Communicable, maternal, neonatal, and nutritional diseases category)")
```
<h3>Non Communicable diseases</h3>
```{r, echo=FALSE}
non_communicable_death_numbers = aggregate(val_mean ~ age, data=data_noncomm_a, FUN=sum)
plot_ly(x = non_communicable_death_numbers$age, y = non_communicable_death_numbers$val_mean, type = "bar") %>%
layout(title = "Avg. no. of deaths due to non-communicable diseases vs Age group",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths due to non-communicable diseases",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
The trend in the plot is similar to the trend in the first plot. The average number of deaths decrease till the 5-14 years age group and then increases. The increase observed is very steep. The average number of deaths in the age group 70+ years is more than the sum of the average number of deaths of all other age groups under this category of causes of deaths. Let us explore further on what were the main causes under this category.
</p>
```{r, echo=FALSE}
plot_ly(data_noncomm_a, x = ~age, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths due to non-communicable diseases vs Age",
xaxis = list(title = "Age Groups",
zeroline = FALSE, family = "times new roman"),
yaxis = list(title = "Avg. no. of deaths non-communicable diseases",
zeroline = FALSE, family = "times new roman"),
font=list(family="Times New Roman", size=12))
```
<p>
There seem to be less average number of deaths due to non-communicable diseases in the three age groups <1 year, 1-4 years, and 5-14 years. Cardiovascular diseases and Neoplasms are the two main non-communicale diseases that have recoded maximum average number of deaths in this category of causes of deaths.
</p>
<h3>Let us now check the average number of deaths and causes based on sex.</h3>
```{r, echo=FALSE}
plot_ly(death_numbers_2, x = ~age, y = ~val_mean, color = ~sex, type = "bar") %>%
layout(title="Avg. no. of deaths vs Age (by sex)",
xaxis = list(title = "Age Groups",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
Comparing the two graphs above, we see that irrespective of the sex, the same pattern of decrese in the average number of deaths till 5-14 years age group and then increase till 70+ years age group is observed.\
It can also be observed from the graph that the average number of deaths in males was greater than the average number of deaths in females under all age groups considered except '70+ years' group. Let us explore the division on the average number of death based on sex.
</p>
```{r, echo=FALSE}
death_numbers_2 <- death_numbers_2 %>% dplyr::select(age, sex, val_mean)
death_numbers_2$val_mean <- as.integer(death_numbers_2$val_mean)
colnames(death_numbers_2) <- c("age", "sex", "Freq")
ggplot(death_numbers_2) +
geom_mosaic(
aes(x=product(sex, age), # cut from right to left
weight=Freq,
fill=sex
),
divider=c("vspine" , "hspine") # equivalent to divider=ddecker()
) +
ggtitle("Mosaic plot") +
xlab("Age groups") + ylab("Sex")
```
<p>
Here we come back to mosaic plots as it would be easy for comparisons. By looking at the graph we can easily compare the relative average number of deaths in males and females. But again, the accurate value is not known using mosaic plots.
</p>
```{r, echo=FALSE}
#DATA
data_injuries_sex <- data_injuries2 %>%
group_by(sex, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
data_comm_sex <- df_comm %>%
group_by(sex, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
data_noncomm_sex <- df_noncomm %>%
group_by(sex, cause) %>%
dplyr::summarise(val_sum=sum(val),
val_mean=(mean(val)))
```
```{r, echo=FALSE}
#PLOT injuries
plot_ly(data_injuries_sex, x = ~sex, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths vs Sex (by Injuries)",
xaxis = list(title = "Sex",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
This graph shows the average number of deaths of females and males by the causes of death under the injuries category. Overall, it can be observed that majority of the deaths were due to unintensional injuries. In all the three causes included in the graph, the average number of deaths of females is less than that of males. Deaths due to transport injuries are least among the three in case of females.
</p>
```{r, echo=FALSE}
#PLOT comm...
plot_ly(data_comm_sex, x = ~sex, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths vs Sex \n (by communicable, maternal, neonatal, and nutritional diseases)",
xaxis = list(title = "Sex",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
From this graph, we observe that the average number of deaths due to enteric infections, HIV/AIDS nad sexually transmitted infections, neglected tropical diseases and malaria, nutritional deficiencies, and other infectious diseases is almost equal in case of females and males. \
Huge variation is observed in the average number of deaths due of maternal and neonatal disorders and due to respiratory infections and tuberculosis. In both cases, the average is more in males than that in females. In females, respiratory infections and tuberculosis causes the maximum average number of deaths in this category of deaths and in males, maternal and neonatal disorders cause the maximum average number of deaths.
</p>
```{r, echo=FALSE}
#PLOT non-communicable diseases
plot_ly(data_noncomm_sex, x = ~sex, y = ~val_mean, color = ~cause, type = "bar") %>%
layout(title="Avg. no. of deaths vs Sex (by non-communicable diseases)",
xaxis = list(title = "Sex",
zeroline = FALSE),
yaxis = list(title = "Avg. no. of deaths",
zeroline = FALSE),
font=list(family="Times New Roman", size=12))
```
<p>
In the category of non-communicable diseases, the average number of deaths due to diabetes is approximately equal in both females and males. Cardiovascular diseases and neoplasms (cancer) are the top two cause of death in both males and females under this category of causes of death. Both these have higher count in males than in females. The only disease that caused significantly higher deaths in females than in males under this category of causes of death is neurological disorders. \
\
NOTE: We tried to include cleaveland dot plot / scatter plot, but felt that including bar charts is a better choice on comparing them because of the ease of comparison of values in the bar charts.
</p>