biostats_r_code/Week9_Notes.Rpres at master · alexfrieden/biostats_r_code · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
Week9_Notes
========================================================
author: Alexander Frieden
date: 4/4/2016
autosize: true

Motivating Question
========================================================
Suppose I have a set of variants found in a sick population $S$.  I also have another population of the variants found in the general population $G$.  How can I use $S$ and $G$ to infer which variants are deleterious (disease causing), and which are not.

New Tools
========================================================
To do this, we will need some new tools to do this


Fisher's Exact Test
========================================================
* The test is useful for categorical data that result from classifying objects in two different ways

* It is used to examine the significance of the association (contingency) between the two kinds of classification.

Fisher's Exact Test (part 2)
========================================================
* Chi Square Test is based on approximation that works best when $n$ is large

* We can conduct a hypothesis comparing two proprotions using Fisher's Exact Test.

* This test lets us compute the exact probability of the occurrence of the observed frequencies in the contingency table

* Technique is very useful in that we can do this for small $n$.

Basic FET Example (part 1)
========================================================
Set up data frame.
```{r}
challenge.df = matrix(c(1,4,7,4), nrow = 2)
```

Basic FET Example (part 2)
========================================================
Run Fisher's Exact Test
```{r}
fisher.test(challenge.df)
```

Basic FET Example (part 3)
========================================================
The p-value calculated for the test does not provide any evidence against the assumption of independence. In this example this means that we cannot confidently claim any difference in performance for the two challengers.


Drosophilia Example
=========================================================
McDonald and Kreitman (1991) sequenced the alcohol dehydrogenase gene in several individuals of three species of Drosophila. Varying sites were classified as synonymous (the nucleotide variation does not change an amino acid) or amino acid replacements, and they were also classified as polymorphic (varying within a species) or fixed differences between species.

The two nominal variables are thus synonymicity ("synonymous" or "replacement") and fixity ("polymorphic" or "fixed"). In the absence of natural selection, the ratio of synonymous to replacement sites should be the same for polymorphisms and fixed differences.

There were 43 synonymous polymorphisms, 2 replacement polymorphisms, 17 synonymous fixed differences, and 7 replacement fixed differences.

Drosophilia Example (part 2)
=========================================================
|              |  synonymous|  replacement|
|-------------:|------------:|-------------:|
| polymorphisms|          43 |             2|
|         fixed|          17 |             7|

The result is $P=0.0067$, indicating that the null hypothesis can be rejected; there is a significant difference in synonymous/replacement ratio between polymorphisms and fixed differences.

Note
=========================================================

A number of the following are taken from Andrea S. Foulkes "Applied Statistical Genetics with R"
Back to Hardy Weinberg
=========================================================
* Tests of Hardy Weinberg include Pearson's chi squared test.  The $\chi^2$ test is computationally advantageous.

* This is because it relies an asymptotic assumptions, hence why it likes large $n$.

* For example, when more than 20% of expected counts are less than five, it is better to use Fisher's Exact Test.

* For the following example, a homologous gene (or homolog) is a gene inherited in two species by a common ancestor

Back to Hardy Weinberg (part 2)
=========================================================
Consider the following:

![](pictures/homolog_table.png)

Back to Hardy Weinberg (part 3)
=========================================================

* This is a genotype table about a single locus.

* Here $n_{11}$ and $n_{22}$ be the number of inidividuals with AA and aa respectively.

* Note that Aa and aA are indistinguishable in population based investigations.

Back to Hardy Weinberg (part 4)
=========================================================

We can only get the observed sum

$$
n^*_{12} = n_{12} + n_{21}
$$

Back to Hardy Weinberg (part 5)
=========================================================
The expected counts correspond to these three observed counts

Let
$$
E_{11} = Np_A^2 \\
E_{12} = 2Np_A(1-p_A) \\
E_{22} = N(1-p_A)^2
$$

Where

$$
p_A = \frac{2n_{11} + n_{12}^*}{2N}
$$

Back to Hardy Weinberg (part 6)
=========================================================

We can then construct the chi-square statistic

$$
\tag{9.1}
\chi^2 = \sum_{(i,j)\in C} \frac{(O_{ij}-E_{ij})^2}{E_{ij}}
$$

Where $C$ is the set of our three alleles from the previous table.

Back to Hardy Weinberg (part 7)
=========================================================

* As with any chi-square test, we have a number of degrees of freedom.

* In this case we have a single degree of freedom because knowledge of the counts in one of the cells fully determines the counts in the rest of them, conditional on the marginal totals.

Back to Hardy Weinberg (part 8)
=========================================================
For example, suppose we know $n_{12}^*$.

Then we can get $n_{11}$ and $n_{22}$ from earlier summation.

Note that a statistically significant test result of HWE suggests that the SNP we are investigating is in Hardy-Weinberg *disequilibrium*. HWD is often called *non-random mating*

R comparison example (part 1)
=========================================================
* Suppose we are interested in testing for HWE for the SNP labeled **AKT1.C0756A** in HGDP (Human Genome Diversity Project)

* To do this, we need to calculate observed and expected genotype counts

R comparison example (part 2)
=========================================================

```{r}
hgdp <- read.delim("http://stat-gen.org/book.e1/data/HGDP_AKT1.txt", header=T, sep="\t")
attach(hgdp)
Akt1Snp1 <- AKT1.C0756A
ObsCount <- table(Akt1Snp1)
Nobs <- sum(ObsCount)
ObsCount
```

R comparison example (part 3)
=========================================================
```{r}
FreqC <- (2 * ObsCount[3] + ObsCount[2])/(2*Nobs)
ExpCount <- c(Nobs*(1-FreqC)^2,2*Nobs*FreqC*(1-FreqC),Nobs*FreqC^2)
ExpCount
```

R comparison example (part 4)
=========================================================
* In this example, vectors of observed and expected are for AA,CA, CC.

* The $\chi^2$-statistic is calculated using equation (9.1).

* We now implement it.

R comparison example (part 5)
=========================================================
```{r}
ChiSqStat <- sum((ObsCount - ExpCount)^2/ExpCount)
ChiSqStat
```

R comparison example (part 6)
=========================================================

Lets now compute the $\chi^2$-statistic needed for statistical significance at 5%.  We can compute this by looking at the quantile generated by $1-\alpha$ where $\alpha = 0.05$

```{r}
qchisq(1-0.05,df=1)
```

R comparison example (part 7)
=========================================================
* Since 6.93 > 3.84, based on this we would **reject** the null hypothesis of HWE.

* We would instead conclude that the alleles on two homologous chromosomes are associated with one another.

* Alternatively the **HWE.chisq()** function in the genetics packagr can be used to calculate this statistic.

R comparison example (part 8)
=========================================================
Use **genotype()** to create genotype objects
```{r}
install.packages("genetics", repos = "http://cran.us.r-project.org")
library(genetics)
Akt1Snp1 <- genotype(AKT1.C0756A, sep="")
```

R comparison example (part 9)
=========================================================
```{r}
HWE.chisq(Akt1Snp1)

```

R comparison example (part 10)
=========================================================

* Note that we are returned both a $\chi^2$ statistic and a p-value.  The p-value is still above the 0.05 that we are looking for so we fail to reject null hypothesis.

* The p-value from Fisher's exact testis based on summing the exact probabilities of seeing the observed count data or something more extreme in the direction of the alternative hypothesis.

R comparison example (part 11)
=========================================================

Fisher showed the exact probability from a contingency table (froom the one presented earlier) is given by:

$$
p_A = \frac{{n_{1.} \choose n_{11}}{n_{2.} \choose n_{21}}}{{N \choose n_{.1}}} = \frac{n_{1.}!n_{2.}!n_{.1}!n_{.2}!}{N!n_{11}!n_{12}!n_{21}!n_{22}!}
$$

Where
$$
n_{.1} = n_{11} + n_{21} \\
n_{1.} = n_{11} + n_{12} \\
n_{.2} = n_{12} + n_{22} \\
n_{2.} = n_{21} + n_{22} \\
n = n_{11} + n_{12} + n_{21} + n_{22}
$$

R comparison example (part 12)
=========================================================

It was showen by Emigh (1980) for the genetics setting that if we let $n_1 = 2 * n_{11} + n_{12}^*$, we can write the probability as:

$$
\tag{9.2}
p_A = \frac{n \choose n_{11},n_{12}^*,n_{22}}{2n \choose n_1}2^{n_{12}^*}
$$

If you are unfamiliar with permutation and combination notation, please see:
https://en.wikipedia.org/wiki/Permutation

R comparison example (part 13)
=========================================================
Now lets compute exact probability and exact p-value of HWE.

Suppose we are interested in testing for a departure from HWE for the same SNP from previous example but only within the Maya population.  This is a group of 25 individuals.

R comparison example (part 14)
=========================================================
```{r}
attach(hgdp)
Akt1Snp1Maya <- AKT1.C0756A[Population == "Maya"]
ObsCount <- table(Akt1Snp1Maya)
ObsCount
```

R comparison example (part 15)
=========================================================
```{r}
Nobs<-sum(ObsCount)
FreqC <- (2 * ObsCount[3] + ObsCount[2])/(2*Nobs)
ExpCount <- c(Nobs*(1-FreqC)^2, 2*Nobs*FreqC*(1-FreqC), Nobs*FreqC^2)
ExpCount
```

R comparison example (part 16)
=========================================================
Since the expected count for the first cell is less than 5, using Fisher's Exact Test to test for HWE is most appropriate.

An exact probability of eseing the observed counts, as described in (9.2) is given by **FisherP1**.

R comparison example (part 17)
=========================================================
First set our values
```{r}
n11 <- ObsCount[3]
n12 <- ObsCount[2]
n22 <- ObsCount[1]
n1 <- 2*n11+n12
```

R comparison example (part 18)
=========================================================
calculate numerator and denominator of (9.2)then compute $p_A$
```{r}
Num <- 2^n12 * factorial(Nobs)/prod(factorial(ObsCount))
Denom <- factorial(2*Nobs) / (factorial(n1)*factorial(2*Nobs-n1))
FisherP1 <- Num/Denom
FisherP1
```

R comparison example (part 19)
=========================================================

Fisher's Exact p-value is given by summing over all probabilities of seeing something as extreme as or *more* extreme than the observed data.

Thus to arrive at this p-value, we need to perform the calculation above for the more extreme situations as well, given by $n_{11} = 19$,$n_{11} = 20$, and $n_{11} = 21$.  Adjust $n_{12}$ and $n_{22}$ accordingly.

We can do this by using the **HWE.exact()** method in the genetics package.

R comparison example (part 20)
=========================================================

```{r}
library(genetics)
Akt1Snp1Maya <- genotype(AKT1.C0756A[Population == "Maya"],sep = "")
HWE.exact(Akt1Snp1Maya)
```

R comparison example (part 21)
=========================================================

Based on this output, we see that the exact p-value is 0.4843.

Thus, we are unable to reject the null hypothesis that there is a departure from HWE in this population.

Bootstrapping
=========================================================
For the next couple of tools, we will need to use a tool that I am going to skim over called bootstrapping.

The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference on (resample → sample).

Bootstrapping Example part 1
=========================================================
As an example, assume we are interested in the average (or mean) height of people worldwide.

We cannot measure all the people in the global population, so instead we sample only a tiny part of it, and measure that.

Assume the sample is of size N; that is, we measure the heights of N individuals. From that single sample, only one estimate of the mean can be obtained.

In order to reason about the population, we need some sense of the variability of the mean that we have computed.


Bootstrapping Example part 2
=========================================================
The simplest bootstrap method involves taking the original data set of N heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N.

The bootstrap sample is taken from the original using sampling with replacement so, assuming N is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original "real" sample.

This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples we compute its mean (or some other statistic that we want)

Bagging part 1
=========================================================

In the example provided we were able to compute mean of a sample or approximate it for something we can't know directly.

We are now going to use bootstrap to do something completely different: improve statistical learning methods such as decision trees.

Bagging part 2
=========================================================

Two weeks ago we discussed decision trees.  However, those trees in general suffer from high variance.
This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.

Again, this goes back to a lot of tree methods are unfortunately not very robust.  In contrast, if we had a low variance method we would get results have some robust way of computing a model.

Bagging part 3
=========================================================

Bootstrap aggregation or "bagging" is a general purpose procedure for reducing variance of a statistical learning method.

Bagging part 4
=========================================================

Recall that given a set of $n$ independent observations $Z_1,...,Z_n$ each with variance $\sigma^2$, the variance of the mean $\bar{Z}$ of the observations is given by $\sigma^2/n$

In other words, when we average a set of observations, this reduces variance.

Bagging part 5
=========================================================
In this way, we could take many training sets, make many predictions, then average the models in some way.

So suppose we calculate predictors $\hat{f^1}(x),\hat{f^2}(x),...,\hat{f^B}(x)$ using $B$ seperate training sets and average them in order to obtain a single low-variance statistical learning model:

$$
\hat{f}_{avg}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f^b}(x)
$$

Bagging part 6
=========================================================

Because we do not have access to many training sets, we can generate those sets using bootstrapping from the single training data set.

Doing this we generate $B$ different bootstrapped training data sets.  We then train our method on $b^{th}$ bootstrapped training set in order to get $\hat{f^{*b}}(x)$ to arrive at our average.

$$
\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^B \hat{f^{*b}}(x)
$$


Random Forests
=========================================================

Stepping back to trees, we are going to look at some popular tree-based methods.

Random forests start by building a number of decision trees.  At each split in a tree that is considered, a random sample of $m$ predictors is chosen as split candidates from the full set of $p$ predictors.

The split is allowed to use only one of those $m$ predictors .  A fresh sample of $m$ predicators is taken at each split.

Typically we choose $m \approx \sqrt{p}$

Random Forests (part 2): Important Notes
=========================================================

As a result of this, the algorithm is not even allowed to consider a majority of available predictors.
This may sound crazy, but it has a rationale.  Suppose there is a very strong predictor in $p$, among other moderately strong predictors.  In other methods, the method would have opted for this strong one.

As a consequence of this, we would get a number of trees (in this case I am talking about the bagging algorithm) that are similar.

Random Forests (part 3)
=========================================================
* These many trees will be highly correlated.

* Unfortunately, averaging many highly correlated quantities does not lead to as large a reduction of variance as averaging many uncorrelate quantities.

Random Forests (part 4)
=========================================================

In random forests, this is done by using the predictor subset $m$.

When we set $m=\sqrt{p}$, this results in a reduction of test error over setting $m = p$

Another advantage we get is if we have highly correlated predictors, choosing an even smaller $m$ can allow us to get trees with more unsimiliarity.

Boosting (part 1)
=========================================================

* Boosting is an approach to improving the resulting predictions from a decision tree.

* Like Like bagging, boosting is a general approach that can be applied to many different statistical learning methods for learning or classification.


Boosting (part 2)
=========================================================

* In bagging, we create multiple copies of the original training data set sing the bootstrap, fitting a seperate decision tree to each copy, and then combining all the trees in order to create a single predictive model.

* Note that each tree is built using data that is independent from other trees.

Boosting (part 3)
=========================================================

* In Boosting, we do the same thing except that the trees are grown **sequentially**.

* This means each tree is grown using information from other previously grown trees.

* Boosting does not involve boosttrap sampling.  Instead each tree is fit on a modified version of the original data set.

Boosting algorithm
=========================================================

1.  St $\hat{f}(x) = 0$ and $r_i  = y_i$ for all $i$ in the training set.

2.  For $b = 1, 2,...,B$, repeat:
    + Fit a tree $\,\hat{f^b}$ with $d$ splits ($d+1$ terminal nodes) to the training data $(X,r)$.
    + Update $\hat{f}$ by adding in a shrunken version of the new tree.

$$
\tag{9.3}
\hat{f}(x) \leftarrow \hat{f}(x) + \lambda\hat{f^{b}}
$$

 Update the residuals

$$
r_i \leftarrow r_i - \lambda\hat{f^{b}}(x_i)
$$

Boosting algorithm (part 2)
=========================================================

And finally output the boosted model

$$
\hat{f}(x) = \sum_{b=1}^B\lambda\hat{f^{b}}(x)
$$

Boosting general idea
=========================================================

Unlike fitting a single large decision tree to the data, boosting approach instead learns slowly.

We fit a tree to the current residuals rather than the outcome of the tree.

We then add this new decision tree into the fitted function in order to update the residuals.

Each of these trees can be small, as the number of terminal nodes can be made small by making $d$ small

Boosting general idea (part 2)
=========================================================

Having a small value for $\lambda$, the shrinkage parameter, slows the process of learning down and allows for more and different shaped trees.

These trees attack the residuals that are tough to learn.

In general approaches that learn slowly tend to perform well.

Decision Trees using Tree-Based Methods (part 1)
=========================================================

We are going to use the **Boston** data using the **randomForest** package in R.

Note that bagging and random forest are the same when $m=p$ so we are going to use the **randomForest()** method for both.

Decision Trees using Tree-Based Methods (part 2)
=========================================================

```{r}
install.packages("randomForest", repos = "http://cran.us.r-project.org")
library(randomForest)
set.seed(1)
```

Decision Trees using Tree-Based Methods (part 3)
=========================================================
Lets take a look at what the data looks like.  These are housing values in suburbs of Boston.
```{r}
head(Boston)
```

Decision Trees using Tree-Based Methods (part 4)
=========================================================
Here we are using the median home value in our random forest.
```{r}
train = sample(1:nrow(Boston), nrow(Boston)/2)
bag.boston=randomForest(medv~.,data = Boston, subset = train, mtry = 13, importance = TRUE)
bag.boston
```

Decision Trees using Tree-Based Methods (part 5)
=========================================================

The argument **mtry=13** indicates that all 13 predictors should be considered for each split of the tree

in other words, the bagging should be done. Lets see how it does

Decision Trees using Tree-Based Methods (part 6)
=========================================================
```{r, echo=FALSE}
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
boston.test=Boston[-train,"medv"]
plot(yhat.bag, boston.test)
abline(0,1)
```

Decision Trees using Tree-Based Methods (part 7)
=========================================================
```{r}
mean((yhat.bag-boston.test)^2)
```

Decision Trees using Tree-Based Methods (part 8)
=========================================================

The test set MSE associated with the bagged regression tree is 13.16.

This is about half the value obtained using an optimally pruned single tree (25.05).  Going through this was skipped, I can write up something if folks are interested in it.

We could change the number of trees grown using **ntree** argument.

Decision Trees using Tree-Based Methods (part 9)
=========================================================

```{r}
bag.boston=randomForest(medv~., data=Boston, subset=train,mtry=13, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)
```

Decision Trees using Tree-Based Methods (part 10)
=========================================================

So far we have been doing bagging.  In order to do random forest, we need to change the mtry argument.

By default, **randomForest()** will use $p/3$ for regression trees and $\sqrt{p}$ variables when building classification trees.

In the following we will use $mtry = 6$

Decision Trees using Tree-Based Methods (part 11)
=========================================================

```{r}
set.seed(1)
rf.boston = randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)
```

Decision Trees using Tree-Based Methods (part 12)
=========================================================

Test set MSE is 11.48.  This indicates that random forests yielded an improvement over bagging.

This is no surprise, random forests is a popular method because it performs very well in a lot of cases.
If you are interested in these topics, Max Kuhn has a great book on these and more:

http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485

Decision Trees using Tree-Based Methods (part 13)
=========================================================

To go more in depth to gain more insight into our variables, we use the **importance()** function to view importance of each variable.

Decision Trees using Tree-Based Methods (part 14)
=========================================================

```{r}
importance(rf.boston)
```

Decision Trees using Tree-Based Methods (part 15)
=========================================================
To find plots of importance measures we can do:
```{r}
varImpPlot(rf.boston)
```

Decision Trees using Tree-Based Methods (part 16)
=========================================================
We see that across all trees considered, two variables stand out: **lstat** and **rm**

Thus the wealth of the community (lstat) and house size (rm) are the two most important variables.

Decision Trees using Tree-Based Methods (part 17)
=========================================================

For boosting, we will use the **gbm** package.

Since this is a regression problem, we are going to run **gbm()** with **distribution="gaussian"**.

If this was a binary classification problem we would use **distribution="bernoulli"**


Decision Trees using Tree-Based Methods (part 18)
=========================================================
```{r}
install.packages("gbm", repos = "http://cran.us.r-project.org")
library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,], distribution="gaussian",n.trees=5000,interaction.depth=4)
```

Decision Trees using Tree-Based Methods (part 19)
=========================================================
Getting the summary produces a relative influence plot and also outputs the relative influence statistics

Decision Trees using Tree-Based Methods (part 20)
=========================================================
```{r, echo=FALSE, fig.height=4, fig.width=4}
summary(boost.boston)
```

Decision Trees using Tree-Based Methods (part 21)
=========================================================

Again we see lstat and rm are by far the most important variables.  We can also produce partial dependence plots for these two variables.

These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables.

In this case we might expect median house prices are incraseing with **rm** and decreasing with **lstat**

Decision Trees using Tree-Based Methods (part 22)
=========================================================
```{r}
par(mfrow=c(1,2))
plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")
```

Decision Trees using Tree-Based Methods (part 23)
=========================================================
We now use the boosted model to predict **medv**
```{r}
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
```

Decision Trees using Tree-Based Methods (part 24)
=========================================================
The test MSE obtained is 11.84, similar to the test MSE for random forest and superior to that of bagging.

We can augment the shrinking parameter $\lambda$.  The default is 0.001, however lets take $\lambda=0.2$

Decision Trees using Tree-Based Methods (part 25)
=========================================================
And we get an improvement!
```{r}
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",
                 n.trees=5000,interaction.depth=4,shrinkage=0.2)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)
```