-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy paththesis.tex
4727 lines (4283 loc) · 262 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% This is the Reed College LaTeX thesis template. Most of the work
% for the document class was done by Sam Noble (SN), as well as this
% template. Later comments etc. by Ben Salzberg (BTS). Additional
% restructuring and APA support by Jess Youngberg (JY).
% Your comments and suggestions are more than welcome; please email
% them to [email protected]
%
% See http://web.reed.edu/cis/help/latex.html for help. There are a
% great bunch of help pages there, with notes on
% getting started, bibtex, etc. Go there and read it if you're not
% already familiar with LaTeX.
%
% Any line that starts with a percent symbol is a comment.
% They won't show up in the document, and are useful for notes
% to yourself and explaining commands.
% Commenting also removes a line from the document;
% very handy for troubleshooting problems. -BTS
% As far as I know, this follows the requirements laid out in
% the 2002-2003 Senior Handbook. Ask a librarian to check the
% document before binding. -SN
%%
%% Preamble
%%
% \documentclass{<something>} must begin each LaTeX document
\documentclass[11pt,oneside,a4paper]{reedthesis}
% Packages are extensions to the basic LaTeX functions. Whatever you
% want to typeset, there is probably a package out there for it.
% Chemistry (chemtex), screenplays, you name it.
% Check out CTAN to see: http://www.ctan.org/
%%
\usepackage{graphicx,latexsym}
\usepackage{amsmath}
\usepackage{amssymb,amsthm}
\usepackage{longtable,booktabs,setspace}
\usepackage{chemarr} %% Useful for one reaction arrow, useless if you're not a chem major
\usepackage[hyphens]{url}
% Added by CII
\usepackage{hyperref}
\usepackage{lmodern}
\usepackage{float}
\floatplacement{figure}{H}
% End of CII addition
\usepackage{rotating}
% Next line commented out by CII
%%% \usepackage{natbib}
% Comment out the natbib line above and uncomment the following two lines to use the new
% biblatex-chicago style, for Chicago A. Also make some changes at the end where the
% bibliography is included.
%\usepackage{biblatex-chicago}
%\bibliography{thesis}
% Added by CII (Thanks, Hadley!)
% Use ref for internal links
\renewcommand{\hyperref}[2][???]{\autoref{#1}}
\def\chapterautorefname{Chapter}
\def\sectionautorefname{Section}
\def\subsectionautorefname{Subsection}
% End of CII addition
% Added by CII
\usepackage{caption}
\captionsetup{width=5in}
% End of CII addition
% \usepackage{times} % other fonts are available like times, bookman, charter, palatino
% Syntax highlighting #22
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.77,0.63,0.00}{#1}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
% To pass between YAML and LaTeX the dollar signs are added by CII
\title{Chasing The Trajectory of Terrorism: A Machine Learning Based Approach
to Achieve Open Source Intelligence}
\author{Pranav Pandya}
\immatriculation{Immatriculation Number: 552590}
% The month and year that you submit your FINAL draft TO THE LIBRARY (May or December)
\date{24th July 2018}
\division{Business \& Economics}
\advisor{Prof.~Dr.~Markus Loecher}
\institution{Berlin School of Economics and Law}
\degree{Master of Science (M.Sc.)}
%If you have two advisors for some reason, you can use the following
% Uncommented out by CII
\altadvisor{Prof.~Dr.~Markus Schaal}
% End of CII addition
%%% Remember to use the correct department!
\department{Business Intelligence \& Process Management}
% if you're writing a thesis in an interdisciplinary major,
% uncomment the line below and change the text as appropriate.
% check the Senior Handbook if unsure.
%\thedivisionof{The Established Interdisciplinary Committee for}
% if you want the approval page to say "Approved for the Committee",
% uncomment the next line
%\approvedforthe{Committee}
% Added by CII
%%% Copied from knitr
%% maxwidth is the original width if it's less than linewidth
%% otherwise use linewidth (to make sure the graphics do not exceed the margin)
\makeatletter
\def\maxwidth{ %
\ifdim\Gin@nat@width>\linewidth
\linewidth
\else
\Gin@nat@width
\fi
}
\makeatother
\renewcommand{\contentsname}{Table of Contents}
% End of CII addition
\setlength{\parskip}{0pt}
% Added by CII
%\setlength{\parskip}{\baselineskip}
\usepackage[parfill]{parskip}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\Acknowledgements{
I want to express my deep sense of gratitude to my supervisors
Prof.~Dr.~Markus Loecher and Prof.~Dr.~Markus Schaal (Berlin School of
Economics \& Law). Words are inadequate in offering my thanks to them
for their encouragement and cooperation in carrying out this research
project. Their able guidance and useful suggestions helped me in
completing the project work, on time. \par \par
Finally, yet importantly, I would like to express my heartfelt thanks to
my beloved mother for her blessings, encouragement, and wishes for the
successful completion of this research project.
}
\Dedication{
I dedicate this thesis to two people who mean a lot to me. First and
foremost, to my mother Anjana P. Pandya who has been a constant source
of inspiration for me. I am thankful to you for your constant support
and blessings which help me achieve set goals of my life.
Secondly, my maternal grandfather late Shri Upendrabhai M. Joshi who
always believed in my ability. You made a garden of heart and planted
all the good things which gave my life a start. You encouraged me to
dream by fostering and nurturing the seeds of self-esteem. You taught me
the difference between right and wrong and made pathway which will last
a lifetime long. You have gone away forever from this world but your
memories are and will always be in my heart.
}
\Declaration{
I, Pranav Pandya hereby formally declare that I have written the
submitted Master`s thesis entirely by myself without anyone else's
assistance. Where I have drawn on literature or other sources, either in
direct quotes, or in paraphrasing such material, I have referenced the
original author or authors and the source in which it appeared.
I am aware that the use of quotations, or of close paraphrasing, from
books, magazines, newspapers, the internet or other sources, which are
not marked as such, will be considered as an attempt at deception, and
that the thesis will be graded as a fail. In the event that I have
submitted the dissertation - either in whole or in part - for
examination within the framework of another examination, I have informed
the examiners and the board of examiners of this fact.
\hfill\break
\hfill\break
\hfill\break
\hfill\break
\rule{0.3\textwidth}{0.4pt} \hfill\break
\begin{flushleft}
Pranav Pandya\\
Berlin, July 2018\end{flushleft}
}
\Preface{
}
\Abstract{
In recent years, terrorism has taken a whole new dimension and becoming
a global issue because of widespread attacks and comparatively high
number of fatalities. Understanding the attack characteristics of most
active groups and subsequent statistical analysis is, therefore, an
important aspect toward counterterrorism support in the present
situation. In this thesis, we use a variety of data mining techniques
and descriptive analysis to determine, examine and characterize threat
level from top ten most active and violent terrorist groups and then use
machine learning algorithms to avail intelligence toward
counterterrorism support. We use historical data of terrorist attacks
that took place around the world between 1970 to 2016 from the
open-source \href{https://www.start.umd.edu/gtd/about/}{Global Terrorism
Database} and the primary objective is to translate terror incident
related information into actionable intelligence. In other words, we
chase the trajectory of terrorism in the present context with
statistical methods and derive insights that can be useful. \par
A major part of this thesis is based on supervised and unsupervised
machine learning techniques. We use Apriori algorithm to discover
patterns in various groups. From the discovered patterns, one of the
interesting patterns we find is that ISIL is more likely to attack other
terrorists (non-state militia) with bombing/explosion while having
resulting fatalities between 6 to 10 whereas Boko Haram is more likely
to target civilians with explosives, without suicide attack and
resulting fatalities more than 50. Within the supervised machine
learning context, we extend the previous research in time-series
forecasting and make use of TBATS, ETS, Auto Arima and Neural Network
model. We predict the future number of attacks in Afghanistan and SAHEL
region, and the number of fatalities in Iraq at a monthly frequency.
From time-series forecasting, we prove two things; the model that works
best in one time-series data may not be the best in another time-series
data, and that the use of ensemble significantly improves forecasting
accuracy from base models. Similarly, in the classification modeling
part, previous research lacks the use of algorithms that are recently
developed. We also extend the previous research in binary classification
problem and make use of a cutting-edge LightGBM algorithm to predict the
probability of suicide attack. Our model achieves 96\% accuracy in terms
of AUC and correctly classifies ``Yes'' instances of suicide attacks
with 86.5\% accuracy.
}
% End of CII addition
%%
%% End Preamble
%%
%
\begin{document}
% Everything below added by CII
\maketitle
\frontmatter % this stuff will be roman-numbered
\pagestyle{empty} % this removes page numbers from the frontmatter
\begin{declaration}
I, Pranav Pandya hereby formally declare that I have written the
submitted Master`s thesis entirely by myself without anyone else's
assistance. Where I have drawn on literature or other sources, either in
direct quotes, or in paraphrasing such material, I have referenced the
original author or authors and the source in which it appeared.
I am aware that the use of quotations, or of close paraphrasing, from
books, magazines, newspapers, the internet or other sources, which are
not marked as such, will be considered as an attempt at deception, and
that the thesis will be graded as a fail. In the event that I have
submitted the dissertation - either in whole or in part - for
examination within the framework of another examination, I have informed
the examiners and the board of examiners of this fact.
\hfill\break
\hfill\break
\hfill\break
\hfill\break
\rule{0.3\textwidth}{0.4pt} \hfill\break
\begin{flushleft}
Pranav Pandya\\
Berlin, July 2018\end{flushleft}
\end{declaration}
\begin{acknowledgements}
I want to express my deep sense of gratitude to my supervisors
Prof.~Dr.~Markus Loecher and Prof.~Dr.~Markus Schaal (Berlin School of
Economics \& Law). Words are inadequate in offering my thanks to them
for their encouragement and cooperation in carrying out this research
project. Their able guidance and useful suggestions helped me in
completing the project work, on time. \par \par
Finally, yet importantly, I would like to express my heartfelt thanks to
my beloved mother for her blessings, encouragement, and wishes for the
successful completion of this research project.
\end{acknowledgements}
\hypersetup{linkcolor=black}
\setcounter{tocdepth}{2}
\tableofcontents
\listoftables
\listoffigures
\begin{abstract}
In recent years, terrorism has taken a whole new dimension and becoming
a global issue because of widespread attacks and comparatively high
number of fatalities. Understanding the attack characteristics of most
active groups and subsequent statistical analysis is, therefore, an
important aspect toward counterterrorism support in the present
situation. In this thesis, we use a variety of data mining techniques
and descriptive analysis to determine, examine and characterize threat
level from top ten most active and violent terrorist groups and then use
machine learning algorithms to avail intelligence toward
counterterrorism support. We use historical data of terrorist attacks
that took place around the world between 1970 to 2016 from the
open-source \href{https://www.start.umd.edu/gtd/about/}{Global Terrorism
Database} and the primary objective is to translate terror incident
related information into actionable intelligence. In other words, we
chase the trajectory of terrorism in the present context with
statistical methods and derive insights that can be useful. \par
A major part of this thesis is based on supervised and unsupervised
machine learning techniques. We use Apriori algorithm to discover
patterns in various groups. From the discovered patterns, one of the
interesting patterns we find is that ISIL is more likely to attack other
terrorists (non-state militia) with bombing/explosion while having
resulting fatalities between 6 to 10 whereas Boko Haram is more likely
to target civilians with explosives, without suicide attack and
resulting fatalities more than 50. Within the supervised machine
learning context, we extend the previous research in time-series
forecasting and make use of TBATS, ETS, Auto Arima and Neural Network
model. We predict the future number of attacks in Afghanistan and SAHEL
region, and the number of fatalities in Iraq at a monthly frequency.
From time-series forecasting, we prove two things; the model that works
best in one time-series data may not be the best in another time-series
data, and that the use of ensemble significantly improves forecasting
accuracy from base models. Similarly, in the classification modeling
part, previous research lacks the use of algorithms that are recently
developed. We also extend the previous research in binary classification
problem and make use of a cutting-edge LightGBM algorithm to predict the
probability of suicide attack. Our model achieves 96\% accuracy in terms
of AUC and correctly classifies ``Yes'' instances of suicide attacks
with 86.5\% accuracy.
\end{abstract}
\begin{dedication}
I dedicate this thesis to two people who mean a lot to me. First and
foremost, to my mother Anjana P. Pandya who has been a constant source
of inspiration for me. I am thankful to you for your constant support
and blessings which help me achieve set goals of my life.
Secondly, my maternal grandfather late Shri Upendrabhai M. Joshi who
always believed in my ability. You made a garden of heart and planted
all the good things which gave my life a start. You encouraged me to
dream by fostering and nurturing the seeds of self-esteem. You taught me
the difference between right and wrong and made pathway which will last
a lifetime long. You have gone away forever from this world but your
memories are and will always be in my heart.
\end{dedication}
\mainmatter % here the regular arabic numbering starts
\pagestyle{fancyplain} % turns page numbering back on
\fontsize{11}{12}\selectfont
\chapter*{Introduction}\label{introduction}
\addcontentsline{toc}{chapter}{Introduction}
Today, we live in the world where terrorism is becoming a primary
concern because of the growing number of terrorist incidents involving
civilian fatalities and infrastructure damages. The ideology and
intentions behind such attacks is indeed a matter of worry. Living under
the constant threat of terrorist attacks in any place is no better than
living in a jungle and worrying about which animal will attack you and
when. An increase in a number of radicalized attacks around the world is
a clear indication that terrorism transitioning to from a place to an
idea, however, the existence of specific terror group and their attack
characteristics over the period of time can be vital to fight terrorism
and to engage peacekeeping missions effectively. Having said that number
terrorist incidents are growing these days, availability of open-source
data containing information of such incidents, recent developments in
machine learning algorithms and technical infrastructure to handle a
large amount of data open ups variety of ways to turn information into
actionable intelligence.
\section*{Definition of terrorism}\label{definition-of-terrorism}
\addcontentsline{toc}{section}{Definition of terrorism}
Terrorism in a broader sense includes state-sponsored and non-state
sponsored terrorist activities. The scope of this research is limited to
\textbf{non-state sponsored} terrorist activities only. Non-state actors
in simple words mean entities that are not affiliated, directed or
funded by the government and that exercise significant economic,
political or social power and influence at a national and international
level up to certain extent (NIC, 2007). An example of non-state actors
can be NGOs, religious organizations, multinational companies, armed
groups or even an online (Internet) community. ISIL is the prime example
of a non-state actor which falls under armed groups segment.
\begin{quote}
Global Terrorism Database (National Consortium for the Study of
Terrorism and Responses to Terrorism (START), 2016) defines terrorist
attack as a threatened or actual use of illegal force and violence by a
non-state actor to attain a political, economic, religious or social
goal through fear, coercion or intimidation.
\end{quote}
This implies that three of the following attributes are always present
in each event of our chosen dataset:
\begin{itemize}
\tightlist
\item
The incident must be intentional -- the result of a conscious
calculation on the part of a perpetrator.
\item
The incident must entail some level of violence or immediate threat of
violence including property violence, as well as violence against
people.
\item
The perpetrators of the incidents must be sub-national actors.
\end{itemize}
\section*{Problem statement}\label{problem-statement}
\addcontentsline{toc}{section}{Problem statement}
Nowadays, data is considered as the most valuable resource and machine
learning makes it possible to interpret complex data however most use
cases are seen in the business context such as music recommendation,
predicting customer churn or finding a probability of having cancer.
With recent development in machine learning algorithms and access to
open source data and software, there are plenty of opportunities to
correctly understand historical terrorist attacks and prevent the future
conflicts. In the last decade, terrorist attacks have been increased
significantly (data source: GTD) as shown in the plot below:
\begin{figure}
\includegraphics[width=1\linewidth]{thesis_files/figure-latex/unnamed-chunk-1-1} \caption{Terrorist attacks around the world between 1970-2016}\label{fig:unnamed-chunk-1}
\end{figure}
After September 2001 attacks, USA and other powerful nations have
carried out major operations to neutralize the power and spread of known
and most violent terrorist groups within the targeted region such as in
Afghanistan, Iraq and most recently in Syria. It's also worth mentioning
that the United Nations already have ongoing peacekeeping missions in
conflicted regions around the world for a long time. However number of
terror attacks continues to rise and in fact, it is almost on a peak in
the last 5 years. This leads to a question why terrorism is becoming
unstoppable despite the continued efforts. Understanding and
interpreting the attack characteristics of relevant groups in line with
their motivations to do so can reflect the bigger picture. An extensive
research by (Heger, 2010) supports this argument and suggests that a
group's political intentions are revealed when we examine who or what it
chooses to attack.
\hypertarget{research-design-and-data}{\section*{Research design and
data}\label{research-design-and-data}}
\addcontentsline{toc}{section}{Research design and data}
This research employs a mix of qualitative and quantitative research
methodology to achieve the set objective. In total, we evaluate cases of
over 170,000 terrorist attacks. We start with exploratory data analysis
to assess the impact on a global scale and then use a variety of data
mining techniques to determine the most active and violent terrorist
groups. This way, we ensure that the analysis reflects the situation in
present years. We use descriptive statistics to understand the
characteristics of each group over the period of time and locate the
major and minor epicenters (most vulnerable regions) based on threat
level. To examine whether or not chosen groups have a common link with
the number of fatalities, we perform statistical hypothesis test with
ANOVA and PostHoc test.
The research then makes use of a variety of machine learning algorithms
with supervised and unsupervised technique.
\begin{quote}
According to (Samuel, 1959), A well-known researcher in the field of
artificial intelligence who coined the term ``machine learning'',
defines machine learning as a ``field of study that gives computers the
ability to learn without being explicitly programmed''. It is a subset
of artificial intelligence which enables computers to learn from
experience in order to create inference over a possible outcome used
later to take a decision.
\end{quote}
With the Apriori algorithm, we discover interesting patterns through
association rules for individual groups. This way, we can pinpoint the
habits of specific groups. Next, we perform a time-series analysis to
examine seasonal patterns and correlations. To address the broad
question ``when and where'', we use four time-series forecasting models
namely Auto Arima, Neural Network, TBATS, and ETS to predict a future
number of attacks and fatalities. We evaluate and compare the
performance of each model on hold out set and use ensemble approach to
further improve the accuracy of predictions. As illustrated in
\protect\hyperlink{literature-review}{Literature review} section, most
research in time-series forecasting addresses the country and year level
predictions. We extend the previous research in this field with
seasonality component and make forecasts on a monthly frequency.
Similarly, in the classification modeling part, previous research lacks
the use of algorithms that are recently developed and that (practically)
out perform traditional algorithms such as logistic regression, random
forests etc. We extend the previous research in binary classification
context and make use of a cutting-edge LightGBM algorithm to predict the
class probability of an attack involving a suicide attempt. We
illustrate the importance of feature engineering and hyperparameter
optimization for modeling process and describe the reasons why standard
validation techniques such as cross-validation would be a bad choice for
this data. We propose an alternate strategy for validation and use AUC
metric as well as confusion matrix to evaluate model performance on
unseen data. From the trained model, we extract the most important
features and use explainer object to further investigate the
decision-making process behind our model. The scope of analysis can be
further extended with a shiny app which is also an integral part to make
this research handy and interactive.
\textbf{Data}
This research project uses historical data of terrorist attacks that
took place around the world between 1970 to 2016 from open-source
\href{https://www.start.umd.edu/gtd/about/}{Global Terrorism Database
(GTD)} as a main source of data. It is currently the most comprehensive
unclassified database on terrorist events in the world and contains
information on over 170,000 terrorist attacks. It contains information
on the date and location of the incident, the weapons used and the
nature of the target, the number of casualties and the group or
individual responsible if identifiable. The total number of variables is
more than 120 in this data. One of the main reason for choosing this
database is because 4,000,000 news articles and 25,000 news sources were
reviewed to prepare this data from 1998 to 2016 alone (National
Consortium for the Study of Terrorism and Responses to Terrorism
(START), 2016).
Main data is further enriched with country and year wise
socio-economical conditions, arms import/export details and migration
details from World Bank Open Data to get a multi-dimensional view for
some specific analysis. This additional data falls under the category of
early warning indicators (short term and long term) and potentially
linked to the likelihood of violent conflicts as suggested by the
researcher (Walton, 2011) and (Stockholm International Peace Research
Institute, 2017).
An important aspect of this research is a use of open-source data and
open-source software i.e.~R. The reason why media-based data source is
chosen as a primary source of data is that journalists are usually the
first to report and document such incidents and in this regard,
first-hand information plays a significant role in the quantitative
analysis. Since the source of data is from publicly available sources,
the term ``intelligence'' refers to the open-source intelligence (OSINT)
category. Intelligence categories are further explained in the next
chapter.
\section*{Policy and practice
implications}\label{policy-and-practice-implications}
\addcontentsline{toc}{section}{Policy and practice implications}
This research project is an endeavor to achieve actionable intelligence
using a machine learning approach and contributes positively to the
counterterrorism policy. The outcome of this research provides
descriptive findings of most lethal groups, corresponding pattern
discovery through Apriori algorithm and predictive analysis through
time-series forecasting and classification algorithm. Research findings
and insights will be helpful to policy makers or authorities to take
necessary steps in time to prevent future terrorist incidents.
\section*{Deliverables}\label{deliverables}
\addcontentsline{toc}{section}{Deliverables}
\begin{itemize}
\tightlist
\item
a report in pdf version
\item
a report in gitbook version
\item
Shiny app
\item
R scripts
\end{itemize}
To ensure that the research claims are (easily) reproducible, this
thesis uses rmarkdown and bookdown package which allows code execution
in line with a written report. \textbf{gitbook version} of this report
is highly recommended over pdf version because it allows interactivity
for some specific findings such as network graph in pattern discovery
chapter. In addition, a shiny app in R is developed to make the
practical aspects of this research handy, interactive and easily
accessible. This app also allows to further extending the scope of
analysis. All the scripts will be publicly accessible on my GitHub
profile\footnote{\url{https://github.com/pranavpandya84}} after
submission.
\hypertarget{essentials-counter}{\chapter{Essentials of
Counterterrorism}\label{essentials-counter}}
Terrorism research in broad context suggests that intelligence toward
counterterrorism support comes in many form. The primary objective of
this research is achieve actionable intelligence so it is important
identify the type of intelligence. In this chapter, we distinguish
between intelligence disciplines and then justify the reliability and
relevance of chosen data.
\section{Intelligence disciplines}\label{intelligence-disciplines}
An extensive research by (Tanner, 2014) suggests that establishing
methodologies for collecting intelligence is important for authorities/
policy makers to combat terrorism. The Intelligence Officer's Bookshelf
from CIA\footnote{\url{https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/csi-studies/studies/vol-60-no-1/pdfs/Peake-IO-Bookshelf-March-2016.pdf}}
recognizes Human Intelligence (HUMINT), Signals Intelligence (SIGINT),
Geospatial Intelligence (GEOINT), Measurement and Signature Intelligence
(MASINT) and Open Source Intelligence (OSINT) as five main disciplines
of intelligence collection (Lowenthal \& Clark, 2015).
\textbf{Human Intelligence (HUMINT)}
As the name suggests, HUMINT comes from human sources and remains
identical with espionage and clandestine activities. This is one of the
oldest intelligence techniques which use covert as well as overt
individuals to gather information. Example of such individuals can be
diplomats, special agents, field operatives or captured prisoners (The
Interagency OPSEC Support Staff, 1996). According to (CIA, 2013), human
intelligence plays vital role in developing and implementing U.S.
national security policy and foreign policy to protect U.S. interests.
\textbf{Signals Intelligence (SIGNIT)}
SIGNIT is derived from electronic transmissions such as by intercepting
communications between two channels/ parties. In the US, National
Security Agency (NSA) is primarily responsible for signals intelligence
(Groce, 2018). An example of SIGNIT is NSAs mass surveillance program
PRISM which is widely criticized due to dangers associated with it in
terms of misuse.
\begin{quote}
Edward Snowden, a former NSA contractor and source of the Guardian's
investigation on systematic data trawling by the US government, suggests
that, ``The reality is this: if an NSA, FBI, CIA, DIA {[}Defence
Intelligence Agency{]}, etc analyst has access to query raw SIGINT
{[}signals intelligence{]} databases, they can enter and get results for
anything they want. Phone number, email, user id, cell phone handset id
(IMEI), and so on -- it's all the same. The restrictions against this
are policy based, not technically based, and can change at any time.''
(Siddique, 2013)
\end{quote}
\textbf{Geospatial Intelligence (GEOINT)}
GEOINT makes use of geo-spatial analysis and visual representation of
activities on the earth to examine suspicious activities. This is
usually carried out by observation flights, UAVs, drones and satellites
(Brennan, 2016).
\textbf{Measurement and Signature Intelligence (MASINT)}
MASINT is comparatively less known methodology however it's becoming
extremely important when concerns about WMDs (Weapons of Mass
Destruction) are increasing. This approach peforms analysis of data from
specific sensors for the purpose of identifying any distinctive features
associated with the source emitter or sender. This analysis serves as
scientific and technical intelligence information. An example of MASINT
is FBI's extensive forensic work that helps detecting traces of nuclear
materials, chemical and biological weapons (Groce, 2018).
\textbf{Open Source Intelligence (OSINT)}
OSINT is relatively new approach that focuses on publicly available
information and sources such as newspaper articles, academic records and
open-source data made available to public from government or
researchers. The key advantage of open source intelligence is
accessibility and makes it possible for individual researchers to
contribute toward counter terrorism support as a part of community. It
is important to note that reliability of data source can be complicated
and thus requires review in order to be a use to policy makers (Groce,
2018; Tanner, 2014).
Focus and scope of work for this research is limited to Open Source
Intelligence only.
\section{OSINT and data relevance}\label{osint-and-data-relevance}
Despite the huge (and technically limitless) potential for counter
terrorism support, the reason as to why open source intelligence is
often reviewed and analysed before it can be used by policy makers is
because of complications related to authenticity of data source and
methodology used to compile data for hypothesis testing by a researcher.
In simple words what it means is, it is extremely important for policy
makers to ensure that there is no selection bias or cherry-picking from
a researcher to claim the success of particular theory or results
(Brennan, 2016). A research paper from (Geddes, 1990/ed) namely
``\emph{How the Cases You Choose Affect the Answers You Get: Selection
Bias in Comparative Politics}'' explains the danger of biased
conclusions when the cases that have achieved the outcome of interest
are studied. This clearly forms the need for reproducible research and
allows authorities to set the standard/ mechanism to safe guard against
selection bias. This is particularly important in terrorism research.
This critical issue can be taken care by codes/ scripts shared through
git repositories. Nowadays, making use of tools such as rmarkdown and
bookdown to deliver reproducible research (Bauer, 2018; Xie, 2016) makes
it even easier to identify selection bias.
\subsection{Open-source databases on
terrorism}\label{open-source-databases-on-terrorism}
In the context of terrorism research, there are many databases available
for academic research. Such databases extracts and compile information
from variety of sources (mainly open-source/ publicly available sources
such as news articles) on regular interval and makes it easy to use for
research. Some of the well-known databases that are open-source and
widely used in academic research for counter terrorism support are as
below:
\textbf{1. Global Terrorism Database (GTD)}\footnote{\url{http://www.start.umd.edu/gtd/about/}}
\begin{itemize}
\tightlist
\item
Currently the most comprehensive unclassified database on terrorist
events in the world
\item
maintained by researchers at the National Consortium for the Study of
Terrorism and Responses to Terrorism (START), headquartered at the
University of Maryland in the USA
\end{itemize}
\textbf{2. Armed Conflict Location and Event Data Project
(ACLED)}\footnote{\url{https://www.acleddata.com/data/}}
\begin{itemize}
\tightlist
\item
provides real-time data on all reported political violence and protest
events however limited to developing countries i.e.~Africa, South
Asia, South East Asia and the Middle East
\end{itemize}
\textbf{3. UCDP/PRIO Armed Conflict Database}\footnote{\url{https://www.prio.org/Data/Armed-Conflict/UCDP-PRIO/}}
\begin{itemize}
\tightlist
\item
a joint project between the UCDP and PRIO that records armed conflicts
from 1946--2016
\item
maintained by Uppsala University in Sweden
\end{itemize}
\textbf{4. SIPRI Databases}\footnote{\url{https://www.sipri.org/databases}}
\begin{itemize}
\tightlist
\item
provides databases on military expenditures, arms transfers, arms
embargoes and peacekeeping operations
\item
maintained by Stockholm International Peace Research Institute
\end{itemize}
In order to address the research objective, I find the Global Terrorism
Database most relevant and it is the main source of data for this
research. As mentioned in
\protect\hyperlink{research-design-and-data}{Research design and data}
section, main data is further enriched with world development indicators
for each countries by year from World Bank Open Data.\footnote{\url{https://data.worldbank.org/}}
\section{What's important in terrorism
research?}\label{whats-important-in-terrorism-research}
Aim of any research can be seen as an effort toward creating new
knowledge, insights or a perspective. In this regard, careful selection
of data source and corresponding statistical analysis based on research
objective is extremely important. Equally important aspect is to share
the data and codes so that research claims or findings can be
reproduced. This also forms the basis for the trustworthiness and
usefulness of the research outcome.
\subsection{Primary vs secondary
sources}\label{primary-vs-secondary-sources}
The term ``sources'' refers to data or a material used in research and
has two distinct categories. The primary sources provide first hand
information about an incident. Secondary sources are normally based on
primary sources and provide interpretive information about an incident
(Indiana University Libraries, 2007). For example, propaganda video/
speech released by ISIL or any other terrorist group are a primary
source whereas newspaper article that publishes journalist's
interpretation of that speech becomes secondary source. Researcher
(Schuurman, 2018) suggests that, in such scenarios, the difference is
not always distinguishable because it depends on the type of question
being asked. Contrary to popular belief, newspaper or media articles are
considered a secondary source of information about terrorism and
terrorists. However news or media articles can be considered as primary
source of information when the research focuses on how media reports on
terrorism (Schuurman, 2018). In our case, the main source of data is
through news and media articles about reported terrorist incidents and
fits the category of primary source of data based on research objective.
\subsection{Use of statistical
analysis}\label{use-of-statistical-analysis}
In most areas of scientific analysis, statistics is often considered as
an important and accepted way to ensure that claims made by researchers
meet defined quality standards (Ranstorp, 2006). To be specific,
descriptive statistics helps describing variables within data and often
used to perform initial data analysis in most research. On the other
hand, inferential statistics helps drawing conclusions/ decisions based
on observed patterns (Patel, 2009).
A prominent researcher (Andrew Silke, 2004), in his book
``\emph{Research on Terrorism: Trends, Achievements and Failures}'',
explains why inferential statistics is significantly important in
terrorism research context. The author suggests that inferential
statistics is useful to introduce element of control into research. In
an experimental research, control is usually obtained by random
assignment of research subjects to experimental and control groups
however it's difficult achieve in real world research. As a result, lack
of control element raises doubt on any relations between variables which
the research claims to find. As a solution, inferential statistics can
help to introduce recognized control element within research and so that
less doubt and more confidence can be achieved over the veracity of
research outcome.
\hypertarget{literature-review}{\chapter{Literature
Review}\label{literature-review}}
I use a structured approach to narrow down recent and relevant
literature. In this chapter, we take a glimpse of prior research in this
field and review the relevant literature in line with factors identified
in \protect\hyperlink{essentials-counter}{Essentials of
Counterterrorism} chapter. In the last part, we examine the literature
gap and relevance with our research topic.
\section{Overview of prior research}\label{overview-of-prior-research}
Scientific research in the field of terrorism is heavily impacted by
research continuance issue. According to (Gordon, 2007), there is indeed
a growing amount of literature in terrorism field but the majority of
contributors are one-timers who visit and study this field, contribute
few articles, and then move to another field. Researcher (Schuurman,
2018) points out another aspect and suggests that terrorism research has
been criticized for a long time for being unable to overcome
methodological issues such as high dependency on secondary sources,
corresponding literature review methods and relatively insufficient
statistical analyses. This argument is further supported a number of
prominent researchers in this field. Compared to other similar fields
such as criminology, terrorism research suffers a lot due to
complications in data availability, reliability and corresponding
analysis to make the research useful to policymakers (Brennan, 2016).
\subsection{Harsh realities}\label{harsh-realities}
One of the harsh realities in terrorism research is that the use of
statistical analysis is fairly uncommon. In late 80s, (Jongman, 1988) in
his book ``\emph{Political Terrorism: A New Guide To Actors, Authors,
Concepts, Data Bases, Theories, And Literature}'' identified serious
concerns in terrorism research related to methodologies used by the
researcher to prepare data and corresponding level of analysis. (A.
Silke, 2001) reviewed the articles in terrorism research between 1995
and 2000 and suggests that key issues raised by (Jongman, 1988) remains
unchanged in that period as well. Their research findings indicate that
only 3\% of research papers involved the use of inferential analysis in
the major terrorism journals. Similar research was carried out by (Lum,
Kennedy, \& Sherley, 2006) on quality of research articles in terrorism
research and their finding suggests that much has been written on
terrorism between 1971 to 2003 and around 14,006 articles were published
however the research that can help/support counterterrorism strategy was
extremely low. This study also suggests that only 3\% of the articles
were based on some form of empirical analysis, 1\% of articles were
identified as case studies and rest of the articles (96\%) were just
thought pieces.
Very recently, researcher (Schuurman, 2018) also conducted an extensive
research to review all the articles (3442) published from 2007 to 2016
in nine academic journals on terrorism and provides an insight on
whether or not the trend (as mentioned) in terrorism research continues.
Their research outcome suggests an upward trend in on the use of
statistical analysis however major proportion is related to descriptive
analysis only. They selected 2552 articles for analysis and their
findings suggest that:
\begin{itemize}
\tightlist
\item
only \textbf{1.3\%} articles made use of inferential statistics
\item
5.8\% articles used mix of descriptive and inferential statistics
\item
14.7\% articles used descriptive statistics and
\item
78.1\% articles did not use any kind of statistical analysis
\end{itemize}
\begin{figure}
\includegraphics[width=1\linewidth]{figure/research_stats} \caption{Use of statistics in terrorism research from 2007 to 2016}\label{fig:stats1}
\end{figure}
(Schuurman, 2018)
\subsection{Review of relevant
literature}\label{review-of-relevant-literature}
In this section, we take a look at previous research that is intended
toward counterterrorism support while making sure that the chosen
research article/ literature contains at least some form of statistical
modeling.
Simple linear regression was one of the approaches for prediction models
in early days but soon it was realized that such models are weak in
capturing complex interactions. The emergence of machine learning
algorithms and advancement in deep learning made it possible to develop
fairly complex models however country-level analysis with resolution at
year level contributes majority of research work in conflict prediction
(Cederman \& Weidmann, 2017).
(Beck, King, \& Zeng, 2000) carried out a research to stress the
important of the causes of conflict. Researchers claim that empirical
findings in the literature of global conflict are often unsatisfying,
and accurate forecasts are unrealistic despite availability immense data
collections, notable journals, and complex analyses. Their approach uses
a version of a neural network model and argues that their forecasts are
significantly better than previous effort.
In a study to investigate the factors that explain when terrorist groups
are most or least likely to target civilians, researcher (Heger, 2010)
examines why terrorist groups need community support and introduces new
data on terrorist groups. The research then uses logit analysis to test
the relationship between independent variables and civilian attacks
between 1960-2000.
In a unique and interesting approach, a researcher from ETH Zürich
(Chadefaux, 2014) examines a comprehensive dataset of historical
newspaper articles and introduces weekly risk index. This new variable
is then applied to a dataset of all wars reported since 1990. The
outcome of this study suggests that the number of conflict-related news
items increases dramatically prior to the onset of conflict. Researcher
claims that the onset of a war data within the next few months could be
predicted with up to 85\% confidence using only information available at
the time. Another researcher (Cederman \& Weidmann, 2017) supports the
hypothesis and suggests that news reports are capable to capture
political tension at a much higher temporal resolution and so that such
variables have much stronger predictive power on war onset compared to
traditional structural variables.
One of the notable (and publicly known) researches in terrorism
predicted the military coup in Thailand 1 month before its actual
occurrence on 7 May 2014. In a report commissioned by the CIA-funded
Political Instability Task Force, researchers (Ward Lab, 2014)
forecasted irregular regime changes for coups, successful protest
campaigns, and armed rebellions, for 168 countries around the world for
the 6-month period from April to September 2014. Researchers claim that
Thailand was number 4 on their forecast list. They used an ensemble
model that combines seven different split-population duration models.
Researchers (Fujita, Shinomoto, \& Rocha, 2016) use high temporal
resolution data across multiple cities in Syria and time-series
forecasting method to predict future event of deaths in Syrian armed
conflict. Their approach uses day level data on death tolls from
Violations Documentation Centre (VDC) in Syria. Using Auto-regression
(AR) and Vector Auto-regression (VAR) models, their study identifies
strong positive auto-correlations in Syrian cities and non-trivial
cross-correlations across some of them. Researchers suggest that strong
positive auto-correlations possibly reflects a sequence of attacks
within short periods triggered by a single attack, as well as
significant cross-correlation in some of the Syrian cities imply that
deaths in one city were accompanied by deaths at another city.
Within a pattern recognition context, researchers (Klausen, Marks, \&
Zaman, 2016) from MIT Sloan developed a behavioural model to predict
which Twitter users are likely belonged to the Islamic state group.
Using data of approximately 5,000 Twitter users who were linked with
Islamic state group members, they created a dataset of 1.3 million users
by associating friends and followers of target users. At the same time,
they monitored Twitter over few months to identify which profiles are
getting suspended. Researchers claim that they were able to train a
machine learning model that matched suspended accounts with the
specifics of the profile and creating a framework to identify likely
members of ISIL.
A similar research from (Ceron, Curini, \& Iacus, 2018) examines over 25
million tweets in Arabic language when Islamic State was at its peak
strength (between Jan 2014 to Jan 2015) and was expanding regions under
its control. Researchers assessed the share of support from the online
Arab community toward ISIS and investigated time time-granularity of
tweets while linking the tweet opinions with daily events and
geolocation of tweets. The outcome of their research finds a
relationship between foreign fighters joining ISIS and online opinions
across the regions.
One of the researches evaluates the targeting patterns and preferences
of 480 terrorist groups that were operational between 1980 and 2011 in
order to find the impact of longetivity of terrorist groups based on
their lethality. Based on group-specific case studies on the Afghan and
Pakistani Taliban and Harmony Database from Combat Terrorism Centre,
researcher (Nawaz, 2017) uses Bivariate Probit Model to assess the
endogenous relationship and finds significant correlationship between
negative group reputation and group mortality. The researcher also uses
Cox Proportional Hazard Model to estimate longetivity of group.
(Colaresi \& Mahmood, 2017) carried out a research to identify and avoid
the problem of overfitting sample data. Researchers used the models of
civil war onset data and came up with a tool (R package: ModelCriticism)
to illustrate how machine learning based research design can improve out
of fold forecasting performance. Their study recommends making use of
validation split along with train and test split to benefit from
iterative model criticism.
Researchers (Muchlinski, Siroky, He, \& Kocher, 2016/ed) use The Civil
War Data (1945-2000) and compared the performance of Random Forests
model with three different versions of logistic regression. The outcome
of their study suggests that random forest model provides significantly
more accurate predictions on the occurrences of rare events in out of
sample data compared to logistic regression models on a chosen dataset.