-
Notifications
You must be signed in to change notification settings - Fork 3
Angsd
NB: the example angsd analysis makefile includes variables that are used for submitting to a compute cluster running a SLURM resource manager.
angsd is a program for doing analysis of next generation sequencing data. It's focus is on population genetics. The file Makefile.angsd contains rules for some of the most common operations. In addition, there is a template file Makefile.analysis.angsd for quickly setting up an analysis.
The rules generally obey the pattern %.suffix, where % can be a population identifier. In some cases, analyses are run on specific chromosomes. Then, the pattern % could be constructed as POP_CHROMOSOME. In these cases, one must make sure the names of the requirements correspond to the desired output.
As an example, one of the analyses consists of generating an sfs prior, preferably based on the entire data (all chromosomes). One of the rules for creating an sfs prior is
%.sfs.em2.ml: %.sfs
$(ANGSD_EMOPTIM2) $< $(ANGSD_OPTION_NCHR) $(ANGSD_EMOPTIM2_OPTIONS) > [email protected]
$(AWK) '{for (i=1; i<NF; i++) printf("%f ",exp($$i)); printf("%f", exp($$i))}' [email protected] > [email protected] && mv [email protected] $@
The output file, POP.sfs.em2.ml, can be used as input when generating a thetas file:
%.thetas.gz: %.sfs.em2.ml
$(if $(findstring $(ANGSD_CHR_LABEL),$(notdir $*)),$(eval ANGSD_OPTION_CHR=$(word 2,$(subst _, ,$(notdir $*)))_$(word 3,$(subst _, ,$(notdir $*)))),$(eval ANGSD_OPTION_CHR=$(word 2,$(subst _, ,$(notdir $*)))))
$(ANGSD) $(ANGSD_DOTHETAS_OPTIONS) -pest $< -bam $*.list -out $*.tmp && mv $*.tmp.thetas.gz $@ && mv $*.tmp.arg $*.arg
This rule is setup to automagically work on filenames formatted as POP_CHROMOSOME_NUMBER.thetas.gz. It will look at the pattern %, and if it finds the string $(ANGSD_CHR_LABEL) (a variable representing the chromosome names, typically 'chr' or 'scaffold'), it will split the string based on "_" and set ANGSD_OPTION_CHR to the second match (the first being the population) so as to run thetas on this chromosome only. The pattern in prerequisite %.sfs.em2.ml should be identical (POP_CHROMOSOME_NUMBER). In other words, if one wants to use the population-level sfs.em2.ml file as input when calculating thetas statistics, a trick is to create a link:
ln -s POP.sfs.em2.ml POP_CHROMOSOME_NUMBER.sfs.em2.ml
Then, the following command will calculate thetas statistics for CHROMOSOME_NUMBER:
make POP_CHROMOSOME_NUMBER.thetas.gz
See more examples below.
Copy the example file to an appropriate location and rename it Makefile. Edit the appropriate variables. Most likely, you have to set the following:
- ANGSD_POPULATIONS: population prefixes or labels
- ANGSD_CHROMOSOMES: names of chromosomes
Moreover, you probably have to redefine the rule that generates the files that list input bam files. The default example rule is:
# The following example lists file in an input directory that match
# %[0-9][0-9].bam
#
%.list:
ls -1 /path/to/inputdata/$(firstword $(subst ., ,$@))[0-9][0-9].bam | sort > [email protected] && mv [email protected] $@
# Phony target to generate lists
lists: $(addsuffix .list,$(ANGSD_POPULATIONS))
The lists target creates a list of names POP.list for each POP in ANGSD_POPULATIONS.
Once a Makefile has been setup, it should be fairly straightforward to run analyses. Here follow a couple of examples. I'm assuming the following variables have been set:
ANGSD_POPULATIONS=A B C D
ANGSD_CHR_ENUM_FROM=0
ANGSD_CHR_ENUM_TO=10
ANGSD_CHR_ENUM=$(shell seq $(ANGSD_CHR_ENUM_FROM) $(ANGSD_CHR_ENUM_TO))
ANGSD_CHR_LABEL=scaffold
ANGSD_CHROMOSOMES=$(foreach i,${ANGSD_CHR_ENUM},$(ANGSD_CHR_LABEL)_$(i)) $(ANGSD_CHROMOSOMES_EXTRA)
Consequently, we have populations A,B,C and D, and chromosomes scaffold_0 to scaffold_10.
make A.sfs.em2.ml
For all chromosomes:
make A.thetas.gz
For chromosome scaffold_0:
ln -s A.sfs.em2.ml A_scaffold_0.sfs.em2.ml
make A_scaffold_0.thetas.gz
make pestpg
Included in biomake is a simple R-script for plotting the thetas. Running
make plotthetas
will generate pdfs of the windows-based files.