Skip to content
Per Unneberg edited this page Dec 3, 2013 · 2 revisions

NB: the example angsd analysis makefile includes variables that are used for submitting to a compute cluster running a SLURM resource manager.

angsd is a program for doing analysis of next generation sequencing data. It's focus is on population genetics. The file Makefile.angsd contains rules for some of the most common operations. In addition, there is a template file Makefile.analysis.angsd for quickly setting up an analysis.

A note about rules and patterns

The rules generally obey the pattern %.suffix, where % can be a population identifier. In some cases, analyses are run on specific chromosomes. Then, the pattern % could be constructed as POP_CHROMOSOME. In these cases, one must make sure the names of the requirements correspond to the desired output.

As an example, one of the analyses consists of generating an sfs prior, preferably based on the entire data (all chromosomes). One of the rules for creating an sfs prior is

%.sfs.em2.ml: %.sfs
	$(ANGSD_EMOPTIM2) $< $(ANGSD_OPTION_NCHR) $(ANGSD_EMOPTIM2_OPTIONS) > [email protected]
	$(AWK) '{for (i=1; i<NF; i++) printf("%f ",exp($$i)); printf("%f", exp($$i))}' [email protected] > [email protected] && mv [email protected] $@

The output file, POP.sfs.em2.ml, can be used as input when generating a thetas file:

%.thetas.gz: %.sfs.em2.ml
	$(if $(findstring $(ANGSD_CHR_LABEL),$(notdir $*)),$(eval ANGSD_OPTION_CHR=$(word 2,$(subst _, ,$(notdir $*)))_$(word 3,$(subst _, ,$(notdir $*)))),$(eval ANGSD_OPTION_CHR=$(word 2,$(subst _, ,$(notdir $*)))))
	$(ANGSD) $(ANGSD_DOTHETAS_OPTIONS) -pest $< -bam $*.list -out $*.tmp && mv $*.tmp.thetas.gz $@ && mv $*.tmp.arg $*.arg

This rule is setup to automagically work on filenames formatted as POP_CHROMOSOME_NUMBER.thetas.gz. It will look at the pattern %, and if it finds the string $(ANGSD_CHR_LABEL) (a variable representing the chromosome names, typically 'chr' or 'scaffold'), it will split the string based on "_" and set ANGSD_OPTION_CHR to the second match (the first being the population) so as to run thetas on this chromosome only. The pattern in prerequisite %.sfs.em2.ml should be identical (POP_CHROMOSOME_NUMBER). In other words, if one wants to use the population-level sfs.em2.ml file as input when calculating thetas statistics, a trick is to create a link:

ln -s POP.sfs.em2.ml POP_CHROMOSOME_NUMBER.sfs.em2.ml

Then, the following command will calculate thetas statistics for CHROMOSOME_NUMBER:

make POP_CHROMOSOME_NUMBER.thetas.gz

See more examples below.

Setting up the makefile

Copy the example file to an appropriate location and rename it Makefile. Edit the appropriate variables. Most likely, you have to set the following:

  1. ANGSD_POPULATIONS: population prefixes or labels
  2. ANGSD_CHROMOSOMES: names of chromosomes

Moreover, you probably have to redefine the rule that generates the files that list input bam files. The default example rule is:

# The following example lists file in an input directory that match
# %[0-9][0-9].bam
#
%.list:
 	ls -1 /path/to/inputdata/$(firstword $(subst ., ,$@))[0-9][0-9].bam | sort > [email protected] && mv [email protected] $@

# Phony target to generate lists
lists: $(addsuffix .list,$(ANGSD_POPULATIONS))

The lists target creates a list of names POP.list for each POP in ANGSD_POPULATIONS.

Examples

Once a Makefile has been setup, it should be fairly straightforward to run analyses. Here follow a couple of examples. I'm assuming the following variables have been set:

ANGSD_POPULATIONS=A B C D
ANGSD_CHR_ENUM_FROM=0
ANGSD_CHR_ENUM_TO=10
ANGSD_CHR_ENUM=$(shell seq $(ANGSD_CHR_ENUM_FROM) $(ANGSD_CHR_ENUM_TO))
ANGSD_CHR_LABEL=scaffold
ANGSD_CHROMOSOMES=$(foreach i,${ANGSD_CHR_ENUM},$(ANGSD_CHR_LABEL)_$(i)) $(ANGSD_CHROMOSOMES_EXTRA)

Consequently, we have populations A,B,C and D, and chromosomes scaffold_0 to scaffold_10.

Generating an sfs prior

make A.sfs.em2.ml

Calculate thetas

For all chromosomes:

make A.thetas.gz

For chromosome scaffold_0:

ln -s A.sfs.em2.ml A_scaffold_0.sfs.em2.ml
make A_scaffold_0.thetas.gz

Generate windows-based estimates

make pestpg

Plot thetas

Included in biomake is a simple R-script for plotting the thetas. Running

make plotthetas

will generate pdfs of the windows-based files.