-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathindex.html
1096 lines (1061 loc) · 54.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>How to be good at Operations</title>
<meta name="description" content="How to be Good at Operations">
<meta name="author" content="Adam Jacob">
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<link rel="stylesheet" href="css/reveal.min.css">
<link rel="stylesheet" href="css/theme/default.css" id="theme">
<link rel="stylesheet" href="css/logo.css">
<!-- For syntax highlighting -->
<link rel="stylesheet" href="lib/css/zenburn.css">
<!-- If the query includes 'print-pdf', include the PDF print sheet -->
<script>
if( window.location.search.match( /print-pdf/gi ) ) {
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = 'css/print/pdf.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
}
</script>
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]-->
<style>
span .green {
color: #b3e2cd;
}
span .orange {
color: #fbcdac;
}
</style>
</head>
<body>
<div class="reveal">
<div class="slides">
<section>
<h1>How to be good at Operations</h1>
<h3>in 40 minutes</h3>
<p>
Created by <a href="mailto:[email protected]">Adam Jacob</a>
/
<a href="http://twitter.com/adamhjk">@adamhjk</a>
</p>
<a href="http://getchef.com">
<img alt="chef logo" src="images/chef-logo.svg" style="background: none; border: none; vertical-align:middle; outline: none;"/>
</a>
<p>
<small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="http://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
<br/>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
<p><a href="http://github.com/adamhjk/good-at-ops">Fork This Presentation on Github</a></p>
</small>
</section>
<section>
<h1>sur·vey</h1>
<p>
noun,
sərˈvā/
<blockquote cite="http://google.com/search?define:survey">
1. a general view, examination, or description of someone or something.
"the author provides a survey of the relevant literature"
</blockquote>
<img src="images/survey.jpg" alt="survey manhole" width="400"/>
<aside class="notes">
<ul>
<li>Each topic worthy of 40 minutes itself</li>
<li>Goal is to teach the <b>reason</b> for things</li>
<li>So you make better implementation choices</li>
</ul>
</aside>
</section>
<section data-background="images/operations.jpg">
<div style="background-color:#000000; opacity: 0.9; padding:20px">
<h1>What is Operations?</h1>
<h3>By which we mean <i>technical</i> operations</h3>
<blockquote>
The work of building and maintaining computer systems, networks, and applications.
</blockquote>
<small>
<a href="https://flic.kr/p/7WWtVV">Original Image</a>
</small>
</div>
<aside class="notes">
<ul>
<li>The definition covers everyone</li>
<li>This is why "devops" is obvious and the new normal</li>
</ul>
</aside>
</section>
<section>
<h1>How to be good at Operations</h1>
<p>Design to improve the <span style="color:#a6cee3">safety</span>, <span style="color:#fb9a99">contentment</span>, <span style="color:#ffff99">knowledge</span> and <span style="color:#cab2d6">freedom</span> of your colleagues and users.</p>
<p>Focus on improving <span style="color:#7fc97f">availability</span> through reducing MTTD and MTTR.</p>
<p>Improve the organizations <span style="color:#fdc086">efficiency</span> through improvements in People, Process, and Technology.</p>
</section>
<section>
<section>
<h2>Done well, Operations enhances the <span style="color:#7fc97f">safety</span>, <span style="color:#beaed4">contentment</span>, <span style="color:#fdc086">knowledge</span> and <span style="color:#ffff99">freedom</span> of both the authors and users of the system.</h2>
<aside class="notes">
<ul>
<li>Design is fundamental</li>
<li>Each choice you make needs to make life better for the humans involved</li>
<li>That also leads to better business outcomes, as we'll learn later</li>
<li>Ultimately, the most scalable, fastest systems are also the ones that are best for the humans invovled, most of the time</li>
</ul>
</aside>
</section>
<section data-background="images/safety.jpg">
<div style="background-color:#000000; opacity:0.90; padding:20px;">
<h1>Safety</h1>
<ul>
<li>Human safety</li>
<li>Information safety</li>
<li>Availability of the system as a possible link to both</li>
<li>The ability for individuals to act without fear of unintended consequences</li>
</ul>
<small><p>Safety is a slider – different systems have different thresholds</p><p><a href="https://flic.kr/p/nBZ9Mx">Original Photo</a></small>
</div>
<aside class="notes">
<ul>
<li>Imagine you were early days at twitter</li>
<li>The system wasn't human safety critical, in your mind</li>
<li>Until it became a source of human saftey and communication during countless revolutions</li>
</ul>
</aside>
</section>
<section data-background="images/contentment.jpg">
<div style="padding:20px; background-color:#000000; opacity:0.90;">
<h1>Contentment</h1>
<p>Contentment is about being satisfied with what you have.</p>
<p>The state of our systems is often a source of deep discontent :)</p>
<p>It may not make you happier – but it won’t hurt</p>
<blockquote>
Happiness is not a goal – it’s a by-product of a life well lived
<br/>- Eleanor Roosevelt
</blockquote>
<small>
<a href="https://flic.kr/p/72AXzX">Original Image</a>
</small>
</div>
<aside class="notes">
<ul>
<li>Happiness is fleeting</li>
<li>If you are in trouble, contentment helps you make better decisions</li>
<li>Think about a brutal on call week - if the systems that support you are good, you survive</li>
</ul>
</aside>
</section>
<section data-background="images/knowledge.jpg">
<div style="padding:20px; background-color:#000000; opacity:0.90;">
<h1>Knowledge</h1>
<p>Access to knowledge is a leading indicator of social progress.</p>
<p>We should be making it easier to understand what the system is for, why we need it, and what good outcomes are.</p>
<p>The goal isn’t to minimize needed knowledge – its to provide access to the wealth of it, when we need it.</p>
<small>
<a href="https://flic.kr/p/51wXrK">Original Image</a>
</small>
</div>
<aside class="notes">
<ul>
<li>The right knowledge, at the right time</li>
<li>Think about PaaS - its awesome you just git push</li>
<li>Until you are Rap Genius and heroku changes the router and everything sucks and you don't know why</li>
<li>Which doesn't make PaaS awful - at a different level of criticality, who cares?</li>
</ul>
</aside>
</section>
<section data-background="images/freedom.jpg">
<div style="padding:20px; background-color:#000000; opacity:0.90;">
<blockquote>
The power or right to act, speak, or think as one wants without hindrance or restraint.<br/> – The Internet
</blockquote>
<p>We should be empowering ourselves and others to act, speak, and think as they need to with less hindrance.</p>
<small><a href="https://flic.kr/p/5anoq">Original Image</a></small>
</div>
<aside class="notes">
<ul>
<li>The Big Web got this right</li>
<li>Empower individuals to work as they see fit</li>
<li>Trust them to do the right things</li>
<li>Build systems that increase the trust needed to allow more freedom</li>
</ul>
</aside>
</section>
<section>
<h1>Safety</h1>
<hr/>
<h1>Contentment</h1>
<hr/>
<h1>Knowledge</h1>
<hr/>
<h1>Freedom</h1>
<aside class="notes">
<ul>
<li>We will come back to these throughout</li>
</ul>
</aside>
</section>
</section>
<section>
<section data-background="/images/operations.png">
<h1>Being good at Operations</h1>
Means being good at two things
</section>
<section>
<h1>Availability</h1>
<hr/>
<h1>Efficiency</h1>
<aside class="notes">
<ul>
<li>Availability: Is the system down? Bring it back up.</li>
<li>Efficiency:Make the effort required to do work <span style="color:#00FF00">easier</span>.</li>
<li>The work here is building and maintaining computers, networks, and applications</li>
<li>So efficiently doing that covers damn near everything</li>
</ul>
</aside>
</section>
<section>
<h1>Focus on Availability</h1>
<hr/>
<h1>Efficiency Follows</h1>
<aside class="notes">
<ul>
<li>Availability shows where you need to be most efficient <b>now</b></li>
<li>It's a virtuous cycle</li>
</ul>
</aside>
</section>
</section>
<section>
<section data-background="images/availability.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Availability</h1>
$$Availability = \frac{Uptime}{(Uptime + Downtime)}$$
<small>
Much thanks to <a href="http://twitter.com/postwait">Theo Schlossnagle</a>, <a href="http://twitter.com/allspaw">John Allspaw</a>, <a href="http://twitter.com/patrickdebois">Patrick Debois</a>, and others for informing
much of this section. Mistakes are mine.
</small>
</div>
</section>
<section>
<h1>Availability is everybody's problem</h1>
<aside class="notes">
<ul>
<li>There is no team that owns availability - other than the company itself</li>
<li>The problems are too big</li>
</ul>
</aside>
</section>
<section data-background="images/nines.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>The 9's</h1>
<table>
<thead>
<tr>
<th>Availability</th>
<th>Downtime per month</th>
</tr>
</thead>
<tbody>
<tr>
<td>90% (one nine)</td>
<td>72 hours</td>
</tr>
<tr>
<td>99% (two nines)</td>
<td>7.2 hours</td>
</tr>
<tr>
<td>99.9% (three nines)</td>
<td>43.8 minutes</td>
</tr>
<tr>
<td>99.99% (four nines)</td>
<td>4.32 minutes</td>
</tr>
<tr>
<td>99.999% (five nines)</td>
<td>25.9 seconds</td>
</tr>
</tbody>
</table>
<small><a href="https://flic.kr/p/9qX72r">Original Image</a></small>
</div>
<aside class="notes">
<ul>
<li>The difference in magnitude matters - days, hours, half hours, minutes, seconds</li>
<li>To achieve higher levels, everything has to get more precise</li>
<li>Know your target, and communicate it</li>
<li>It probably isn't five nines</li>
</ul>
</aside>
</section>
<section>
<h1>The M's</h1>
<ul>
<li><span class="fragment highlight-current-green">Mean Time To Failure (MTTF) ↑</span>
<br>The average time there is correct behavior</br></li>
<li><span class="fragment highlight-current-green">Mean Time To Diagnose (MTTD) ↓</span>
<br>The average time it takes to diagnose the problem</br>
</li>
<li><span class="fragment highlight-current-green">Mean Time To Repair (MTTR) ↓</span>
<br>The average time it takes to fix a problem</br>
</li>
<li><span class="fragment highlight-current-green">Mean Time Between Failures (MTBF) ↑</span>
<br>The average time between failures</br></li>
</ul>
<img src="images/the_m_s.svg" alt="The M*s"/>
<aside class="notes">
<ul>
<li>We want to decrease MTTD and MTTR</li>
<li>And increase MTTF and MTBF</li>
</ul>
</aside>
</section>
<section data-background="images/focus.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Focus your efforts</h1>
<p>
On reducing <span style="color:#fdae6b">Mean Time to Diagnose</span> and <span style="color:#fdae6b">Mean Time to Repair</span>.
</p>
<p>
Failure is <span style="color:#f03b20">inevitable</span> - it's how you detect and react that matter most to availability.
</p>
<small><a href="https://flic.kr/p/6P7aT7">Original Image</a></small>
</div>
<aside class="notes">
<ul>
<li>All systems fail</li>
<li>Fear of failure is the greatest killer of availability</li>
</ul>
</aside>
</section>
<section>
<h1>Slow and ponderous</h1>
<hr/>
<h1>Fast and nimble</h1>
<aside class="notes">
<ul>
<li>Online banking is a huge thing for consumer banks</li>
<li>I met with one that has 5 9's of availability</li>
<li>They achieved this through changing the website once ever 6 months</li>
<li>After a torture chamber of hate and pain</li>
<li>They were not better at diagnose and repair - they were good at MTBF, and lucky</li>
<li>Contrast that with a more nimble org, who might have more frequent outages (say scheduled maintenance once a week)</li>
<li>But the system improves week over week</li>
<li>Raise your hand which one you want!</li>
<li>It's safer, increases human contentment, is easier to reason about, and frees people up</li>
</ul>
</aside>
</section>
<section data-background="images/metrics.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Diagnose</h1>
<h3>Metrics Collection</h3>
<p>Collect metrics from the <span style="color:#b3e2cd">operating system</span>,
<span style="color:#fdcdac">network</span>, and <span style="color:#cbd5e8">applications</span>.</p>
<p>High <span style="color:#fdae6b">resolution matters</span>!</p>
<p>As few systems as possible.</p>
<small><a href="https://flic.kr/p/48Lpmu">Original Image</a></small>
</div>
<aside class="notes">
<ul>
<li>You can't fix what you can't see</li>
<li>Metrics resolution has direct impact on MTTD</li>
</ul>
</aside>
</section>
<section data-background="images/money.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Diagnose</h1>
<h3>Two Critical Metrics</h3>
<ol>
<li> <span style="color:#fbcdac">Is it up</span> - from a users perspective </li>
<li> <span style="color:#b3e2cd">Is it making money</span> </li>
</ol>
<p><small><br/><a href="https://flic.kr/p/82xTtv">Original Image</a></small></p>
</div>
<aside class="notes">
<ul>
<li>One binary metric - can your users use your stuff</li>
<li>Money is often a trailing indicator of deeper systemic problems that are hard to see</li>
<li>I helped run an ad network back in the day, and the hour-by-hour money graph was the fastest way to see if we were letting people run over cap</li>
<li>Money graph also helps you justify other activity!</li>
</ul>
</aside>
</section>
<section>
<h1>Diagnose</h1>
<h3>Graphing, Trends and Analysis</h3>
<p>Use graphs to understand normal behavior.</p>
<p><img src="images/boringtrend.png" alt="boring trend"></p>
<p>
<small>
<a href="http://omniti.com/seeds/dissecting-todays-internet-traffic-spikes">Graphs taken from Theo Schlossnagle and OmniTI</a>
</small>
</p>
<aside class="notes">
<ul>
<li>Lets say this is puppy.com - the prime source for puppy news</li>
<li>Nice, easy content day - 70% utilization, smooth peaks and valleys</li>
</ul>
</aside>
</section>
<section>
<img src="images/doge.jpg" width="200px"/>
<img src="images/taco-bell.jpeg" width="270px"/>
<aside class="notes">
<ul>
<li>The Doge dog beats up the taco bell chihuahua outside the most posh dog park in LA</li>
<li>Puppy.com has the exclusive video</li>
</ul>
</aside>
</section>
<section>
<h1>Diagnose</h1>
<h3>Graphing, Trends and Analysis</h3>
<p>Use graphs to understand abnormal behavior.</p>
<p><img src="images/spikesdissected.png" alt="spikes dissected"></p>
<p><small>
<a href="http://omniti.com/seeds/dissecting-todays-internet-traffic-spikes">Graphs taken from Theo Schlossnagle and OmniTI</a>
</small>
</p>
<aside class="notes">
<ul>
<li>The new york times picks it up, and adds long exposure traffic</li>
<li>Digg shows up, and it goes to 11</li>
<li>Happens in 60 seconds!</li>
</ul>
</aside>
</section>
<section>
<h1>Auto-Scaling Will Not Save You</h1>
<aside class="notes">
<ul>
<li>Either you design for this load, or you fail to meet the expectations</li>
<li>The right answer here is serve puppy.com from behind fast.ly :)</li>
</ul>
</aside>
</section>
<section data-background="images/capacity.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Capacity Planning</h1>
<ol>
<li>Identify key metrics</li>
<li>Put them on a graph</li>
<li>Set a limit</li>
<li>Plot a trend line</li>
<li>Expand your time horizon</li>
</ol>
<p>
<small><br/><a href="https://flic.kr/p/hD7JTZ">Original Image</a></small></p>
</div>
</section>
<section>
<h1>Capacity Planning</h1>
<img src="images/linear-regression.png" alt="linear regression"/>
<aside class="notes">
<ul>
<li>Do this on a regular cadence - monthly, etc.
<li>Show your R-squared - think of it as a confidence number</li>
<li>This could be any metric that matters for your system</li>
<li>This is the number one source of trivially preventable outages</li>
</ul>
</aside>
</section>
<section data-background="images/alert.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Diagnose</h1>
<h3>Alerts</h3>
<p>Get the attention of the right humans.</p>
<ul>
<li>As <span style="color:#b3e2cd;" class="green">few alerts as possible</span></li>
<li>Routed to the <span style="color:#fbcdac;" class="orange">people who can take action</span></li>
<li>Start with the <b>is it up</b> alert</li>
<li>Never create an alert that isn't actionable!</li>
</ul>
</div>
<aside class="notes">
<ul>
<li>There is nothing more disrespectful than waking someone up for shit they can't fix</li>
<li>It's happening.. its happening... again</li>
</ul>
</aside>
</section>
<section data-background="images/f15.jpg">
<h1>Repair</h1>
<h3>Incident Response</h3>
<p>
<img src="images/ooda.svg" alt="ooda"/>
</p>
<small><a href="https://flic.kr/p/84f5uq">Original Image</a></small>
<aside class="notes">
<ul>
<li>Observe: whats going on</li>
<li>Orient: put whats going on in context of waht you know about the system, people, and dynamics</li>
<li>Decide: what to do next</li>
<li>Act: take action</li>
<li>Originally for fighter pilots to get inside the heads of the enemy</li>
<li>A faster loop means success in combat</li>
<li>This is the same pattern for responding to operations availability issues</li>
</ul>
</aside>
</section>
<section data-background="images/orient.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Repair</h1>
<h3>Orient</h3>
<p>
<span style="color:#b3e2cd;">Orient</span> is the step we often fail at.
</p>
<p>
<b>Thinking</b> is the best tool we have in incident response.
</p>
<p>
Understanding more about the system, and how each piece behaves, is what separates the good from the great.
</p>
<p>
<a href="http://www.informit.com/articles/article.aspx?p=1941206">What Rob Pike learned from Ken Thompson</a>
</p>
</div>
<aside class="notes">
<ul>
<li>In fighter jets, knowing typical behavior, jets, and culture was crucial</li>
<li>Rob Pike and Ken Thompson working on a visual language</li>
<li>Rob typed faster, so he was at the keyboard</li>
<li>Rob attacked bugs, Ken thought about it</li>
<li>Ken was <b>orienting</b> better</li>
<li>Unlike a fighter jet, he had time :)</li>
</ul>
</aside>
</section>
<section>
<h1>Repair</h1>
<h3>Incident Command</h3>
<p>
The First Responder is the default <span style="color:#b3e2cd;">Incident Commander</span>
</p>
<ol>
<li> Decides what to do next </li>
<li> Coordinates resources </li>
<li> Can hand off command </li>
<li> Communicates status </li>
<li> Not about <i><b>rank</b></i></li>
</ol>
<p>
There is only <span style="color:#fbcdac;">ONE</span> Incident Commander.
</p>
<p>
<small>
This isn't always true in real Incident Command, but go with it.
</small>
</p>
<aside class="notes">
<ul>
<li>When it gets bigger than one person can handle, we flip to this</li>
<li>Knowing we have a Process, and command structure makes it easier to OODA</li>
<li>And faster loops means faster resolution</li>
</ul>
</aside>
</section>
<section data-background="images/post-mortem.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Learn</h1>
<h3>Post Mortem</h3>
<p>
Incident Commander schedules a post mortem within 24 hours of incident resolution.
</p>
<p>
Purpose is to <span style="color:#b3e2cd;">learn from the incident</span>, and
and <span style="color:#fbcdac;">identify the work</span> needed to:
</p>
<ul>
<li> Prevent recurrence (if necessary)</li>
<li> Improve Mean Time To Diagnose </li>
<li> Improve Mean Time To Repair </li>
</ul>
<p>
<small><br/><a href="https://flic.kr/p/w4dcU">Original Image</a></small>
</div>
<aside class="notes">
<ul>
<li>This should be the IC at the end of the incident</li>
</ul>
</aside>
</section>
<section>
<blockquote>
Progress on safety coincides with learning from failure. <span style="color:#b3e2cd;">This makes punishment
and learning two mutually exclusive activities: Organizations can either learn
from an accident or punish the individuals involved in it, but hardly do both
at the same time.</span> The reason is that punishment of individuals can protect
false beliefs about basically safe systems, where humans are the least reliable
components. Learning challenges and potentially changes the belief about what
creates safety. Moreover, <span style="color:#fbcdac;">punishment emphasizes that failures are deviant, that
they do not naturally belong in the organization</span>...</blockquote>
<small>Sidney W.A. Dekker, Ten Questions about Human Error: A New View of Human Factors and System Safety (Human Factors in Transportation)</small>
</section>
<section>
<h1>Learn</h1>
<h3>How to run a Post Mortem</h3>
<ol>
<li> Invoke the space: we are here to learn, not to blame </li>
<li> Describe the incident </li>
<li> Establish the timeline </li>
<li> Identify contributing factors </li>
<li> Describe customer impact </li>
<li> Describe remediation tasks for the root cause</li>
<li> Describe improvement tasks for response process</li>
</ol>
<aside class="notes">
<ul>
<li>We hold post mortems to learn and improve, not to blame and punish</li>
<li>Puppys.com went down when Digg linked to the Doge/Chihuaua story</li>
<li>Story gets posted at 8am PST, NYT picks it up at 8:15am PST, Digg posts at 8:30am PST</li>
<li>Site goes down at 8:30am, alert at 8:31am, diagnosed at 8:50am, more capacity launched on ec2 at 8:55am, online and resolved at 9:00am PST</li>
<li>The traffic load overwhelmed mpm worker apache configuration, and exhausted capacity</li>
<li>People could not watch the doge dog crush the chihuaha, and click ads</li>
<li>Launched more capacity. Long term remediation is to move static content to a CDN</li>
<li>We investigated a denial of service and backend database issues before we looked at traffic graphs. Add passive alert on traffic.</li>
</ul>
</aside>
</section>
<section>
<h1>Prioritize the outcomes</h1>
<aside class="notes">
<ul>
<li>The process works because you prioritize the outcomes</li>
<li>Our remediation steps are the <b>efficiency improvements you want</b></li>
<li>If you fail to act, or do other stuff, you're wasting the opportunity</li>
</ul>
</aside>
</section>
<section data-background="images/roundup.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Availability Roundup</h1>
<ul>
<li>Understand your Availability Targets</li>
<li>Track and understand your M*'s</li>
<li>Reduce time to detect and repair</li>
<li>Use capacity planning to avoid obvious incidents</li>
<li>Have an incident response and command process</li>
<li>Perform and publish post-mortems for every incident</li>
<li>Prioritize the outcomes</li>
</ul>
</div>
</section>
</section>
<section>
<section data-background="images/efficiency.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Efficiency</h1>
$$Efficiency = \frac{Output}{Effort}$$
<p>
Make the effort required to do work <span style="color:#00FF00">easier</span>.
</p>
<p>
<small><a href="https://flic.kr/p/hBiPJZ">Original Image</a></small>
</div>
</section>
<section>
<h1>People</h1>
<hr/>
<h1>Process</h1>
<hr/>
<h1>Technology</h1>
<aside class="notes">
<ul>
<li>3 areas for efficiency, in order or most potential for gains</li>
<li>Think about Puppy's dot com - if we didn't have the right people, if we didn't have a process for incidents, if we didn't have post mortems, the technology fixes wouldn't make a dent long term</li>
</ul>
</aside>
</section>
<section data-background="images/purpose.jpg">
<aside class="notes">
<ul>
<li> What is the mission? </li>
<li> How does your organization intend to fulfill it? </li>
<li> How do you contribute? </li>
<li> What are the stakes? </li>
<li> Knowing your purpose enables you to put decisions in context </li>
<li> The more context you have, the better your decision will be </li>
<li> Like a very long OODA loop </li>
</ul>
</aside>
</section>
<section data-background="images/people.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Know the people</h1>
<ul>
<li> Software Developers </li>
<li> Business Decision Makers </li>
<li> Systems and Network Administrators </li>
<li> Marketing and PR </li>
<li> Sales </li>
<li> Legal </li>
</ul>
<p>
<small><br/><a href="https://flic.kr/p/6wjwMP">Original Image</a></small></p>
</div>
<aside class="notes">
<ul>
<li> Trust is crucial to effective operations </li>
<li> Knowing people is crucial to trusting them </li>
<li> Set up lunch dates </li>
<li> Talk about your lives </li>
<li> <b>THIS IS WHERE DEVOPS COMES FROM</b></li>
<li> John Allspaw and Paul Hammond are <b>friends</b></li>
</ul>
</aside>
</section>
<section>
<img alt="Thich Naht Hanh" src="images/Thich-Nhat-Hanh.jpg" width="200px"/>
<blockquote>
When they create electronic devices, they can reflect on
whether that new product will take people away from themselves,
their family and nature. Instead they can create the kind of
devices and software that can help them go back to themselves, to
take care of their feelings. By doing that, they will feel good
because they’re doing something good for society.
<br/>
- Thich Naht Hanh at Google
</blockquote>
<aside class="notes">
<ul>
<li> The way we do our work informs our lives </li>
<li> Having good lives improves the quality of our work in every dimension </li>
<li> We are blessed to be the architects of our environment </li>
<li> Lets back Thay up with data </li>
</ul>
</aside>
</section>
<section>
<h1>People</h1>
<h2>Engaged Workers Rule</h2>
<img src="images/engagement-outcomes.png" alt="Engaged Workers"/>
<small>
Stats in this section come from asking 25 million employees the same 12 questions in
<a href="http://www.gallup.com/strategicconsulting/163007/state-american-workplace.aspx">Gallup's state of the American Workplace</a>
with causality evidence from <a href="http://pps.sagepub.com/content/5/4/378">Causal Impact of Employee Work Perceptions on the Bottom Line of Organizations</a>.
</small>
<aside class="notes">
<ul>
<li>Gallup has been running this study since the 90s</li>
<li>They have proven the impact engaged workers have is causul</li>
<li>What other single thing could you possibly do that has a 22% impact on profitability?</li>
<li>21% impact on productivity!</li>
<li>65% less turnover!</li>
<li>Or a 41% impact on defects! Happy people care about their work more</li>
<li>It's the most critical operations efficiency task</li>
</ul>
</aside>
</section>
<section data-background="images/luxury.jpg">
<div style="background-color:#000000; opacity:0.80; padding:20px;">
<h1>Sources of Engagement</h1>
<ol>
<li>Clear expectations</li>
<li>Opportunity to shine</li>
<li>Praise</li>
<li>Having people care about you</li>
<li>Having your opinions count</li>
<li>A mission that makes you feel important</li>
<li>Commitment to quality</li>
</ol>
<p><small><br/><a href="https://flic.kr/p/bSr87x">Original Image</a></small></p>
</div>
<aside class="notes">
<ul>
<li>Repetition, Repetition, Repetition</li>
<li>Training people is like training cats - you gotta be on that</li>
</ul>
</aside>
</section>
<section data-background="images/chronic.jpg">
<div style="background-color:#000000; opacity:0.99; padding:20px;">
<h1>Assholes</h1>
<h3>Know you an Asshole</h3>
<ol>
<li>After encountering them, people feel oppressed, humiliated, or otherwise worse about themselves </li>
<li>They target people less powerful than them </li>
</ol>
Chronic assholes are the problem.
<small>
Sections on Assholes taken from <a href="http://smile.amazon.com/Asshole-Rule-Civilized-Workplace-Surviving-ebook">The No Asshole Rule</a>.
</small>
<aside class="notes">
<ul>
<li> Not talking about a bad day - these poeple are out to undo all the good engaged people do</li>
</ul>
</aside>
</div>
</section>
<section data-background="images/inefficient.jpg">
<div style="background-color:#000000; opacity:0.80; padding:20px;">
<h1>Assholes are inefficient</h1>
<p>Positive interactions must outnumber negative ones 5:1</p>
<p>Bad interactions have stronger, more pervasive, and longer lasting effects</p>
<small>
Findings found in <a href="http://liberalorder.typepad.com/the_liberal_order/files/bad_apples_rob.pdf">How, when, and why bad apples spoil the barrel:
Negative group members and dysfunctional groups.</a>
</small>
</div>
<aside class="notes">
<ul>
<li>Pick someone out, insult them gently, then compliment them</li>
<li>Point out this is what they will remember from this talk, forever</li>
</ul>
</aside>
</section>
<section>
<h1>What you can do</h1>
<ul>
<li> Don't be an Asshole, and fire or shun those who are</li>
<li> Set clear expectations for others </li>
<li> Praise people </li>
<li> Make friends with, and care about your co-workers </li>
<li> Listen to each other </li>
<li> Take pride in your work </li>
</ul>
</section>
<section data-background="images/process.jpg">
<h1>Process</h1>
<h3>The way we work is critical to our outcomes</h3>
<p>
<small><br/><a href="https://flic.kr/p/o2sqPt">Original Image</a></small>
</section>
<section data-background="images/kaizen.jpg">
<div style="background-color:#000000; opacity:0.90; padding:20px;">
<h1>Kaizen</h1>
<h2>改善</h2>
<blockquote>Change for the better</blockquote>
<p>Continuous Improvement</p>
<small>
A few lean/improvement resources: <a href="http://www.amazon.com/Lean-Thinking-Banish-Create-Corporation-ebook/">Lean thinking</a>, <a href="http://www.amazon.com/The-Goal-Process-Ongoing-Improvement-ebook">The Goal</a> - there are so many more.
</small>
</div>
</section>
<section>
<h1>Kaizen</h1>
<h2>Small improvements</h2>
<p>Evaluate a process, make it better.</p>
<p>Try using the scientific method:</p>
<ol>
<li>Ask a question</li>
<li>Do research</li>
<li>Construct a hypothesis</li>
<li>Test your hypothesis</li>
<li>Analyze data and draw a conclusion</li>
<li>Communicate your results</li>
</ol>
</section>
<section>
<h1>Kaizen</h1>
<h2>Anyone can do it</h2>
</section>
<section data-background="images/radical.jpg">
<div style="background-color:#000000; opacity:0.90; padding:20px;">
<h1>Kaikaku</h1>
<h2>Radical Change</h2>
<p>Recognize when desired results are beyond incremental improvement.</p>
<p>Start fresh, incorporate a new process, then do Kaizen</p>
</div>
<aside class="notes">
<ul>
<li>Continuous Delivery is a good example</li>
<li>If you are a big, waterfall org with manual testing</li>
<li>Incrementally moving to CD is going to fail</li>
<li>You need to blow up the way you work, learn how that feels, and kaizen your way to happiness</li>
<li>A house built on sand and all that</li>
</ul>
<p><small><br/><a href="https://flic.kr/p/ajkw47">Original Image</a></small></p>
</aside>
</section>
<section>
<h1>Technology</h1>
<h2>Systems Design</h2>
<h3>Understand the requirements</h3>
<p>
<img src="images/org_sucks.png" height="300" alt="This Org Sucks"/>
</p>
Do not mistake existing implementations for hard requirements
<aside class="notes">
<ul>
<li>Big retailers web division, wanted to automate, I wanted to sell software</li>
<li>Asked how they felt about Cd, said they weren't CD people</li>
<li>I was like: Me neither! ;)</li>
<li>They told me their design, said "then we come together and make it work"</li>
<li>We rebuilt it in that room, much better - not real requirements</li>
</ul>
</aside>
</section>
<section data-background="images/scalable.jpg">
<div style="background-color:#000000; opacity:0.90; padding:20px;">
<h1>Scalable Systems Design</h1>
<p>Identify autonomous actors, and have them keep their promises</p>
</div>
</section>
<section>
<h1>Rolling Upgrade</h1>
<img src="images/lb-example.svg" alt="Load Balanced Web Service"/>
<aside class="notes">
<ul>
<li>Traditional web servers behind a load balancer</li>
<li>Upgrade servers one at a time</li>
</ul>
</aside>
</section>
<section>
<img src="images/lb-example.svg" alt="Load Balanced Web Service"/>
<h2>Naive way</h2>
<ol>
<li> Take App1 from Load Balancer Pool</li>
<li> Update Software on App1</li>
<li> Verify update worked</li>
<li> Put App1 back into Load Balancer Pool</li>
</ol>
<small>
What happens if a server is down? What happens to traffic in transit? What if we die in the middle?
</small>
<aside class="notes">
<ul>
<li>This is what you would do if you wrote the steps down!</li>
<li>And it's whats going to happen in any case</li>
<li>But linearly implementing these as a script - whoa doggies</li>
<li>600 configuration changes to the load balancer!</li>
</ul>
</aside>
</section>
<section>
<h1>Autonomous Actors</h1>
Each component responsible for itself
<hr/>
<h1>Promises</h1>
<p>Each Autonomous Actor <i>promises</i> to behave a certain way.</p>
<p>Other Actors can <i>verify</i> those promises.</p>
</section>
<section>
<img src="images/lb-example.svg" alt="Load Balanced Web Service"/>
<h2>Identify Autonomous Actors</h2>
<h3>Load Balancers</h3>
<p>Promises to route traffic to working app servers</p>
<h3>Application Servers</h3>
<p>Promises to serve application traffic and publish status</p>
</section>
<section>
<img src="images/lb-example-better.svg" alt="Load Balanced Web Service"/>
<h2>Better way</h2>
<ol>
<li>Update software on App1</li>
</ol>
<aside class="notes">
<ul>
<li>Add a service that is smart about the apps status to each server</li>
<li>Monitor that service with the load balancer</li>
<li>Upgrade process manages that services response</li>
<li>Load balancer just blindly routes traffic</li>
<li>All the questions from the neive implementation can be answered by improvements to the status endpoint</li>
</ul>
</aside>
</section>
<section>
<p>The better solution has fewer <span style="color:#7fc97f">interactions</span>.</p>
<p>But it has more <span style="color:#beaed4">pieces</span>.</p>
<aside class="notes">
<ul>
<li>We reduced the degree of difficulty in the process</li>
<li>Increased the number of moving parts</li>
<li>Safety: Resilient against many more failure modes </li>
<li>Knowledge Far easier to reason about during Orient in the OODA loop </li>
<li>Freedom: Pattern adapts to different values of "available" based on service needs </li>