author: Alexander Frieden date: 4/4/2016 autosize: true
Suppose I have a set of variants found in a sick population
To do this, we will need some new tools to do this
-
The test is useful for categorical data that result from classifying objects in two different ways
-
It is used to examine the significance of the association (contingency) between the two kinds of classification.
-
Chi Square Test is based on approximation that works best when
$n$ is large -
We can conduct a hypothesis comparing two proprotions using Fisher's Exact Test.
-
This test lets us compute the exact probability of the occurrence of the observed frequencies in the contingency table
-
Technique is very useful in that we can do this for small
$n$ .
Set up data frame.
challenge.df = matrix(c(1,4,7,4), nrow = 2)Run Fisher's Exact Test
fisher.test(challenge.df)
Fisher's Exact Test for Count Data
data: challenge.df
p-value = 0.2821
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.002553456 2.416009239
sample estimates:
odds ratio
0.1624254
The p-value calculated for the test does not provide any evidence against the assumption of independence. In this example this means that we cannot confidently claim any difference in performance for the two challengers.
McDonald and Kreitman (1991) sequenced the alcohol dehydrogenase gene in several individuals of three species of Drosophila. Varying sites were classified as synonymous (the nucleotide variation does not change an amino acid) or amino acid replacements, and they were also classified as polymorphic (varying within a species) or fixed differences between species.
The two nominal variables are thus synonymicity ("synonymous" or "replacement") and fixity ("polymorphic" or "fixed"). In the absence of natural selection, the ratio of synonymous to replacement sites should be the same for polymorphisms and fixed differences.
There were 43 synonymous polymorphisms, 2 replacement polymorphisms, 17 synonymous fixed differences, and 7 replacement fixed differences.
| synonymous | replacement | |
|---|---|---|
| polymorphisms | 43 | 2 |
| fixed | 17 | 7 |
The result is
A number of the following are taken from Andrea S. Foulkes "Applied Statistical Genetics with R" Back to Hardy Weinberg
-
Tests of Hardy Weinberg include Pearson's chi squared test. The
$\chi^2$ test is computationally advantageous. -
This is because it relies an asymptotic assumptions, hence why it likes large
$n$ . -
For example, when more than 20% of expected counts are less than five, it is better to use Fisher's Exact Test.
-
For the following example, a homologous gene (or homolog) is a gene inherited in two species by a common ancestor
Consider the following:
-
This is a genotype table about a single locus.
-
Here
$n_{11}$ and$n_{22}$ be the number of inidividuals with AA and aa respectively. -
Note that Aa and aA are indistinguishable in population based investigations.
We can only get the observed sum
The expected counts correspond to these three observed counts
Let $$ E_{11} = Np_A^2 \ E_{12} = 2Np_A(1-p_A) \ E_{22} = N(1-p_A)^2 $$
Where
We can then construct the chi-square statistic
Where
-
As with any chi-square test, we have a number of degrees of freedom.
-
In this case we have a single degree of freedom because knowledge of the counts in one of the cells fully determines the counts in the rest of them, conditional on the marginal totals.
For example, suppose we know
Then we can get
Note that a statistically significant test result of HWE suggests that the SNP we are investigating is in Hardy-Weinberg disequilibrium. HWD is often called non-random mating
-
Suppose we are interested in testing for HWE for the SNP labeled AKT1.C0756A in HGDP (Human Genome Diversity Project)
-
To do this, we need to calculate observed and expected genotype counts
hgdp <- read.delim("http://stat-gen.org/book.e1/data/HGDP_AKT1.txt", header=T, sep="\t")
attach(hgdp)
Akt1Snp1 <- AKT1.C0756A
ObsCount <- table(Akt1Snp1)
Nobs <- sum(ObsCount)
ObsCountAkt1Snp1
AA CA CC
48 291 724
FreqC <- (2 * ObsCount[3] + ObsCount[2])/(2*Nobs)
ExpCount <- c(Nobs*(1-FreqC)^2,2*Nobs*FreqC*(1-FreqC),Nobs*FreqC^2)
ExpCount CC CC CC
35.22319 316.55362 711.22319
-
In this example, vectors of observed and expected are for AA,CA, CC.
-
The
$\chi^2$ -statistic is calculated using equation (9.1). -
We now implement it.
ChiSqStat <- sum((ObsCount - ExpCount)^2/ExpCount)
ChiSqStat[1] 6.926975
Lets now compute the
qchisq(1-0.05,df=1)[1] 3.841459
-
Since 6.93 > 3.84, based on this we would reject the null hypothesis of HWE.
-
We would instead conclude that the alleles on two homologous chromosomes are associated with one another.
-
Alternatively the HWE.chisq() function in the genetics packagr can be used to calculate this statistic.
Use genotype() to create genotype objects
install.packages("genetics", repos = "http://cran.us.r-project.org")
The downloaded binary packages are in
/var/folders/tj/8dxhxfns3fb0fx5kswwdvjbr0000gp/T//RtmpZsAXSm/downloaded_packages
library(genetics)
Akt1Snp1 <- genotype(AKT1.C0756A, sep="")HWE.chisq(Akt1Snp1)
Pearson's Chi-squared test with simulated p-value (based on 10000
replicates)
data: tab
X-squared = 6.927, df = NA, p-value = 0.009899
-
Note that we are returned both a
$\chi^2$ statistic and a p-value. The p-value is still above the 0.05 that we are looking for so we fail to reject null hypothesis. -
The p-value from Fisher's exact testis based on summing the exact probabilities of seeing the observed count data or something more extreme in the direction of the alternative hypothesis.
Fisher showed the exact probability from a contingency table (froom the one presented earlier) is given by:
Where $$ n_{.1} = n_{11} + n_{21} \ n_{1.} = n_{11} + n_{12} \ n_{.2} = n_{12} + n_{22} \ n_{2.} = n_{21} + n_{22} \ n = n_{11} + n_{12} + n_{21} + n_{22} $$
It was showen by Emigh (1980) for the genetics setting that if we let
If you are unfamiliar with permutation and combination notation, please see: https://en.wikipedia.org/wiki/Permutation
Now lets compute exact probability and exact p-value of HWE.
Suppose we are interested in testing for a departure from HWE for the same SNP from previous example but only within the Maya population. This is a group of 25 individuals.
attach(hgdp)
Akt1Snp1Maya <- AKT1.C0756A[Population == "Maya"]
ObsCount <- table(Akt1Snp1Maya)
ObsCountAkt1Snp1Maya
AA CA CC
1 6 18
Nobs<-sum(ObsCount)
FreqC <- (2 * ObsCount[3] + ObsCount[2])/(2*Nobs)
ExpCount <- c(Nobs*(1-FreqC)^2, 2*Nobs*FreqC*(1-FreqC), Nobs*FreqC^2)
ExpCount CC CC CC
0.64 6.72 17.64
Since the expected count for the first cell is less than 5, using Fisher's Exact Test to test for HWE is most appropriate.
An exact probability of eseing the observed counts, as described in (9.2) is given by FisherP1.
First set our values
n11 <- ObsCount[3]
n12 <- ObsCount[2]
n22 <- ObsCount[1]
n1 <- 2*n11+n12calculate numerator and denominator of (9.2)then compute
Num <- 2^n12 * factorial(Nobs)/prod(factorial(ObsCount))
Denom <- factorial(2*Nobs) / (factorial(n1)*factorial(2*Nobs-n1))
FisherP1 <- Num/Denom
FisherP1 CA
0.4011216
Fisher's Exact p-value is given by summing over all probabilities of seeing something as extreme as or more extreme than the observed data.
Thus to arrive at this p-value, we need to perform the calculation above for the more extreme situations as well, given by
We can do this by using the HWE.exact() method in the genetics package.
library(genetics)
Akt1Snp1Maya <- genotype(AKT1.C0756A[Population == "Maya"],sep = "")
HWE.exact(Akt1Snp1Maya)
Exact Test for Hardy-Weinberg Equilibrium
data: Akt1Snp1Maya
N11 = 18, N12 = 6, N22 = 1, N1 = 42, N2 = 8, p-value = 0.4843
Based on this output, we see that the exact p-value is 0.4843.
Thus, we are unable to reject the null hypothesis that there is a departure from HWE in this population.
For the next couple of tools, we will need to use a tool that I am going to skim over called bootstrapping.
The basic idea of bootstrapping is that inference about a population from sample data (sample → population) can be modeled by resampling the sample data and performing inference on (resample → sample).
As an example, assume we are interested in the average (or mean) height of people worldwide.
We cannot measure all the people in the global population, so instead we sample only a tiny part of it, and measure that.
Assume the sample is of size N; that is, we measure the heights of N individuals. From that single sample, only one estimate of the mean can be obtained.
In order to reason about the population, we need some sense of the variability of the mean that we have computed.
The simplest bootstrap method involves taking the original data set of N heights, and, using a computer, sampling from it to form a new sample (called a 'resample' or bootstrap sample) that is also of size N.
The bootstrap sample is taken from the original using sampling with replacement so, assuming N is sufficiently large, for all practical purposes there is virtually zero probability that it will be identical to the original "real" sample.
This process is repeated a large number of times (typically 1,000 or 10,000 times), and for each of these bootstrap samples we compute its mean (or some other statistic that we want)
In the example provided we were able to compute mean of a sample or approximate it for something we can't know directly.
We are now going to use bootstrap to do something completely different: improve statistical learning methods such as decision trees.
Two weeks ago we discussed decision trees. However, those trees in general suffer from high variance.
This means that if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different.
Again, this goes back to a lot of tree methods are unfortunately not very robust. In contrast, if we had a low variance method we would get results have some robust way of computing a model.
Bootstrap aggregation or "bagging" is a general purpose procedure for reducing variance of a statistical learning method.
Recall that given a set of
In other words, when we average a set of observations, this reduces variance.
In this way, we could take many training sets, make many predictions, then average the models in some way.
So suppose we calculate predictors
Because we do not have access to many training sets, we can generate those sets using bootstrapping from the single training data set.
Doing this we generate
Stepping back to trees, we are going to look at some popular tree-based methods.
Random forests start by building a number of decision trees. At each split in a tree that is considered, a random sample of
The split is allowed to use only one of those
Typically we choose
As a result of this, the algorithm is not even allowed to consider a majority of available predictors.
This may sound crazy, but it has a rationale. Suppose there is a very strong predictor in
As a consequence of this, we would get a number of trees (in this case I am talking about the bagging algorithm) that are similar.
-
These many trees will be highly correlated.
-
Unfortunately, averaging many highly correlated quantities does not lead to as large a reduction of variance as averaging many uncorrelate quantities.
In random forests, this is done by using the predictor subset
When we set
Another advantage we get is if we have highly correlated predictors, choosing an even smaller
-
Boosting is an approach to improving the resulting predictions from a decision tree.
-
Like Like bagging, boosting is a general approach that can be applied to many different statistical learning methods for learning or classification.
-
In bagging, we create multiple copies of the original training data set sing the bootstrap, fitting a seperate decision tree to each copy, and then combining all the trees in order to create a single predictive model.
-
Note that each tree is built using data that is independent from other trees.
-
In Boosting, we do the same thing except that the trees are grown sequentially.
-
This means each tree is grown using information from other previously grown trees.
-
Boosting does not involve boosttrap sampling. Instead each tree is fit on a modified version of the original data set.
-
St
$\hat{f}(x) = 0$ and$r_i = y_i$ for all$i$ in the training set. -
For
$b = 1, 2,...,B$ , repeat:- Fit a tree
$,\hat{f^b}$ with$d$ splits ($d+1$ terminal nodes) to the training data$(X,r)$ . - Update
$\hat{f}$ by adding in a shrunken version of the new tree.
- Fit a tree
Update the residuals
And finally output the boosted model
Unlike fitting a single large decision tree to the data, boosting approach instead learns slowly.
We fit a tree to the current residuals rather than the outcome of the tree.
We then add this new decision tree into the fitted function in order to update the residuals.
Each of these trees can be small, as the number of terminal nodes can be made small by making
Having a small value for
These trees attack the residuals that are tough to learn.
In general approaches that learn slowly tend to perform well.
We are going to use the Boston data using the randomForest package in R.
Note that bagging and random forest are the same when
install.packages("randomForest", repos = "http://cran.us.r-project.org")
The downloaded binary packages are in
/var/folders/tj/8dxhxfns3fb0fx5kswwdvjbr0000gp/T//RtmpZsAXSm/downloaded_packages
library(randomForest)
set.seed(1)Lets take a look at what the data looks like. These are housing values in suburbs of Boston.
head(Boston) crim zn indus chas nox rm age dis rad tax ptratio black
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
lstat medv
1 4.98 24.0
2 9.14 21.6
3 4.03 34.7
4 2.94 33.4
5 5.33 36.2
6 5.21 28.7
Here we are using the median home value in our random forest.
train = sample(1:nrow(Boston), nrow(Boston)/2)
bag.boston=randomForest(medv~.,data = Boston, subset = train, mtry = 13, importance = TRUE)
bag.boston
Call:
randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 13
Mean of squared residuals: 11.08966
% Var explained: 86.57
The argument mtry=13 indicates that all 13 predictors should be considered for each split of the tree
in other words, the bagging should be done. Lets see how it does
mean((yhat.bag-boston.test)^2)[1] 13.33831
The test set MSE associated with the bagged regression tree is 13.16.
This is about half the value obtained using an optimally pruned single tree (25.05). Going through this was skipped, I can write up something if folks are interested in it.
We could change the number of trees grown using ntree argument.
bag.boston=randomForest(medv~., data=Boston, subset=train,mtry=13, ntree=25)
yhat.bag = predict(bag.boston, newdata=Boston[-train,])
mean((yhat.bag-boston.test)^2)[1] 15.96786
So far we have been doing bagging. In order to do random forest, we need to change the mtry argument.
By default, randomForest() will use
In the following we will use
set.seed(1)
rf.boston = randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)
yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)[1] 11.48022
Test set MSE is 11.48. This indicates that random forests yielded an improvement over bagging.
This is no surprise, random forests is a popular method because it performs very well in a lot of cases.
If you are interested in these topics, Max Kuhn has a great book on these and more:
http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485
To go more in depth to gain more insight into our variables, we use the importance() function to view importance of each variable.
importance(rf.boston) %IncMSE IncNodePurity
crim 12.547772 1094.65382
zn 1.375489 64.40060
indus 9.304258 1086.09103
chas 2.518766 76.36804
nox 12.835614 1008.73703
rm 31.646147 6705.02638
age 9.970243 575.13702
dis 12.774430 1351.01978
rad 3.911852 93.78200
tax 7.624043 453.19472
ptratio 12.008194 919.06760
black 7.376024 358.96935
lstat 27.666896 6927.98475
To find plots of importance measures we can do:
varImpPlot(rf.boston)We see that across all trees considered, two variables stand out: lstat and rm
Thus the wealth of the community (lstat) and house size (rm) are the two most important variables.
For boosting, we will use the gbm package.
Since this is a regression problem, we are going to run gbm() with distribution="gaussian".
If this was a binary classification problem we would use distribution="bernoulli"
install.packages("gbm", repos = "http://cran.us.r-project.org")
The downloaded binary packages are in
/var/folders/tj/8dxhxfns3fb0fx5kswwdvjbr0000gp/T//RtmpZsAXSm/downloaded_packages
library(gbm)
set.seed(1)
boost.boston=gbm(medv~.,data=Boston[train,], distribution="gaussian",n.trees=5000,interaction.depth=4)Getting the summary produces a relative influence plot and also outputs the relative influence statistics
var rel.inf
lstat lstat 45.9627334
rm rm 31.2238187
dis dis 6.8087398
crim crim 4.0743784
nox nox 2.5605001
ptratio ptratio 2.2748652
black black 1.7971159
age age 1.6488532
tax tax 1.3595005
indus indus 1.2705924
chas chas 0.8014323
rad rad 0.2026619
zn zn 0.0148083
Again we see lstat and rm are by far the most important variables. We can also produce partial dependence plots for these two variables.
These plots illustrate the marginal effect of the selected variables on the response after integrating out the other variables.
In this case we might expect median house prices are incraseing with rm and decreasing with lstat
par(mfrow=c(1,2))
plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")We now use the boosted model to predict medv
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)[1] 11.84434
The test MSE obtained is 11.84, similar to the test MSE for random forest and superior to that of bagging.
We can augment the shrinking parameter
And we get an improvement!
boost.boston=gbm(medv~.,data=Boston[train,],distribution="gaussian",
n.trees=5000,interaction.depth=4,shrinkage=0.2)
yhat.boost=predict(boost.boston,newdata=Boston[-train,],n.trees=5000)
mean((yhat.boost-boston.test)^2)[1] 11.51109




