In this step, we will download all the genomes from genbank (note, they are in the genomes directory already, so you can skip this step!), and extract protein sequences into a fasta file. We also get the information we need about the genomes.
The file spounavirus.tsv is the spreadsheet with all the spounavirus IDs in. Note that I added an extra column, the first column, with a unique number, that I will use while processing the sequences.
We prefered the RefSeq ID if possible, but otherwise we used the INSDC ID. To make a file with just the IDs we used this one line perl code:
perl -ne '@a=split /\t/; map {$a[$_] =~ s/^\s+//; $a[$_] =~ s/\s+$//} (2,3);
if ($a[2] eq "-") {print "$a[2]\n"} else {print "$a[3]\n"}'
spounavirus.tsv > spounavirus_ids.txt
Next, we get all the genomes in GenBank format. For that, we use the command line edirect utils from NCBI. (Note these weren't around in 1999!)
for GENOME in spounavirus_ids.txt;
do echo $GENOME;
esearch -db nucleotide -query $GENOME | efetch -format gb > genomes/$GENOME.gbk;
done
Now we need to extract all the proteins from all the genomes, and create a list of names that we will use on the final tree. This list of names also includes an automatic abbreviation of the name that is generated. These names are not tested to be unique, but hopefully will be!
Then we create a list of [protein IDs, genome IDs, protein lengths] that we use when we combine everything later.
perl ../genbank_code/genbank2table.pl -t -p proteins.faa -i genome_names.txt genomes/*
perl ../protein_genome_length.pl proteins.faa > protein_genome_length.txt
to test if the names are unique you can run this command and see if anything at the bottom of the list occurs more than once!
cut -f 3 -t$'\t' genome_names.txt | sort | uniq -c | sort -n
In this step, we will use blastp to precluster the proteins. We also rename the proteins with a unique integer (as there an infinite number of integers!). In addition, some of the downstream steps have limited protein name lengths, and so using integers avoids any renaming issues.
Now we are going to identify our groups of proteins. We start by converting proteins.faa to a blast database. The makeblastdb code is part of blast+. The split_blast_queries_edwards_blastplus.pl is code that runs on our cluster using Sun Grid Engine to submit the blast to the cluster. You probably don't need to use the cluster, in which case you can just run a regular blast+ search to end up with the phage_proteins.blastp file.
makeblastdb -in proteins.faa -dbtype prot
perl ../split_blast_queries_edwards_blastplus.pl -f proteins.faa -n 100 -p blastp -db proteins.faa -d phage_proteins_blast -e 0.1
# when this is done concatenate everything
cat phage_proteins_blast/*blastp > phage_proteins.blastp
We cluster all the proteins into groups based on their individual blast results. This is a greedy clustering, as in the next step we'll do an alignment and refine the results. I run this command on a machine with plenty of RAM!
mkdir fastafiles
java -jar ~/PhageProteomicTree/ppt.jar phage_proteins.blastp 0.1 proteins.faa fastafiles
This program writes out the fasta files in a directory (fastafiles). The fasta files have the clusters of proteins, but the proteins have been renamed with an integer for downstream analysis. The code also makes a file called id.map
that has the original protein ids and the new protein ids (which are just integers).
In this step we need to create a couple of ID mapping files so we know where things came from
We need to associate proteins with genomes, and genomes with IDs, so we make two map files, called genome_id.map and protein_genome_lengths.txt that you need for later steps.
perl ~/PhageProteomicTree/make_id_maps.pl id.map protein_genome_length.txt genome_names.txt
In this step, we will run clustalw and protdist. We use the cluster for this, but you probably don't have to since it is largely IO-bound anyway (it doesn't matter if you don't know what that means). If you need to download them, here are the links for clustalw and phylip which contains protdist.
We figure out how many files we have, and we set that as an environment variable. We also set up some directories.
NUMFILES=$(ls fastafiles/ | sed -e 's/fasta.//' | sort -nr | head -n 1); echo "There are $NUMFILES files to process"
for DIR in sge_output aligned dnd protdist; do if [ ! -e $DIR ]; then mkdir $DIR; fi; done
Now comes the heavy lifting. We are going to run clustalw on all those sequence files and generate phylip output files. Then we're going to run protdist on all those output files to get our matrices. We use SGE for this, and submit to the cluster. The option -S says use bash as the shell, and the option -V says use your current environment variables, so that if clustalw or protdist are in your PATH they will be available on the nodes.
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 ~/PhageProteomicTree/clustalw.sh
This gives us a job id, which we use as JID in this submission which holds until the alignments are done:
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 -hold_jid <previous job number> ~/PhageProteomicTree/protdist.sh
Next, we need to clean up the output. First, remove empty protdist files, and then reformat them so that they have one result per line
rm -f `find protdist -size 0`
perl ../rewrite_protdists.pl protdist protdist.fixed
In this step, we combine all the individual distance matrices from protdist into a single distance matrix
The matrix composition is almost complete, we just need to convert them to a single matrix. We need to know how many genomes we expect, so lets set that as a shell variable, and then submit the matrix code to the cluster for calculation. This uses combineprotdists.pl to calculate a single matrix (well, as we'll see, it actually makes two matrices).
NUMGENOMES=$(wc -l genome_names.txt | sed -e 's/\s\+genome_names.txt//')
qsub -cwd -S /bin/bash -V -v NUMGENOMES=$NUMGENOMES -o sge_output -e sge_output ~/PhageProteomicTree/matrix.sh
This step makes two output files matrix and matrix.nosubreplicates. The first has the distance measure and the number of calculations that were used to generate that distance measure. The second has just the distance measure. We will only use the second file.
In the final step we make a neighbor joining tree and rename the tree using the abbreviations generated in Generate the ID files step above.
mkdir neighbor
cp matrix.nosubreplicates neighbor/infile
cd neighbor
neighbor
[For neighbor I usually randomize the input order]
cp outtree ../raw.tree
cd ..
perl ../rename_tree_leaves.pl genome_id.map raw.tree > renamed_full.tree
Here is the final tree. To generate this image I opened the renamed_full.tree file in the awesome FigTree
We have included most of the important files in this repo, though if we forgot some let Rob know. As you will note, some of the intermediate files are missing from this repository because we are limited in size.
Evelien reannotated all the genomes and provided GFF files, but GFF is not a record based file format (yes, I know that is biopython, but the explanation is true for bioperl too). Therefore, before we begin we convert them to genbank files. Hopefully, these files will be useful to others, too.
Before we do any analysis, we are going to change the internal field separator in bash which is normally white space, but we want to process line by line:
export IFS=$'\n'
We start by converting GFF to Genbank, but this converter from EMBL does not create the header information, because it is not included in the files.
for i in *.gff;
do o=$(echo $i | sed -e 's/gff/gbk/');
seqret -sequence $i -outseq $o -osformat2 genbank -sformat1 gff -feature 1;
done
We also need a mapping file that tells us the original genbank filename and the new filename converted from the gff. Unfortunately there was no way to make that but by hand.
Once we have those two, we read a GenBank file, delete the original ORF calls, read the ORF calls from the appropriate GenBank file created from the GFF file, and add them to the backbone of the GenBank file. This is a really convoluted way of making the sequence files but there are two advantages. First, we keep track of the organisms so we can compare them to other's taxonomies, and second, the GenBank files work with all the downstream analysis pipeline.
Here is how to reannotate the ORFs in the GenBank files
IFS=$'\n'
for I in $(cat mapping.tsv);
do
G=$(echo $I | cut -f 2 -d$'\t');
R=$(echo $I | cut -f 1 -d$'\t');
O=$(echo $R | sed -e 's/PROKKA_//');
echo -e "$G\t$R\t$O";
perl merge_annotations.pl genomes/$G reannotated_genomes/$R reannotated_genomes_merged/$O;
done
Now we have a directory called reannotated_genomes_merged that contains all the genbank files that we will process as above.
Lets concatenate those commands in a few shell commands with no explanations:
perl ~/PhageProteomicTree/genbank_code/genbank2table.pl -t -p proteins.faa -i genome_names.txt genomes/*
perl ~/PhageProteomicTree/protein_genome_length.pl proteins.faa > protein_genome_length.txt
cut -f 3 -t$'\t' genome_names.txt | sort | uniq -c | sort -n
makeblastdb -in proteins.faa -dbtype prot
perl ~/PhageProteomicTree/split_blast_queries_edwards_blastplus.pl -f proteins.faa -n 100 -p blastp -db proteins.faa -d phage_proteins_blast -e 0.1
and then
# when this is done concatenate everything
cat phage_proteins_blast/*blastp > phage_proteins.blastp
mkdir fastafiles
java -jar ~/PhageProteomicTree/ppt.jar phage_proteins.blastp 0.1 proteins.faa fastafiles
perl ~/PhageProteomicTree/make_id_maps.pl id.map protein_genome_length.txt genome_names.txt
NUMFILES=$(ls fastafiles/ | sed -e 's/fasta.//' | sort -nr | head -n 1); echo "There are $NUMFILES files to process"
for DIR in sge_output aligned dnd protdist; do if [ ! -e $DIR ]; then mkdir $DIR; fi; done
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 ~/PhageProteomicTree/clustalw.sh
and then
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 -hold_jid PREV_JID ~/PhageProteomicTree/protdist.sh
Once that is complete we have two bash scripts that complete the remainder of the steps. This allows us to submit these to the cluster to be held until the other commands are complete
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -hold_jid PREV_JID ~/PhageProteomicTree/bash_script3.sh
That script submits another job to the cluster, and then you need to run this one.
qsub -cwd -S /bin/bash -V -o sge_output/ -e sge_output/ -hold_jid PREV_JID ~/PhageProteomicTree/bash_script4.sh
Once the analysis is complete, we visualize the tree in FigTree to add the colors. That results in the colored tree file