-
Notifications
You must be signed in to change notification settings - Fork 47
Open
Description
I am trying to make a guide for myself to create a cgMLST or even a wgMLST scheme. I hope this helps others. I suggest adding this in some way to the documentation, although I might still go with @tseemann's suggestion and just go with Chewie. Anyway, if you're like me and just have to try it out first to convince yourself...
Step 1: download the whole scheme from https://chewbbaca.online into a folder Listeria_monocytogenes.chewbbaca
Step 2: add git to the scheme so that you can check it back out in case you make a mistake.
cd Listeria_monocytogenes.chewbbaca
git init
git add *
git add .*
git commit -m init
git tag v1
Step 2b, undocumented: make sure there are no deflines with *
indicating custom alleles not in the universal set of alleles.
Step 3a, not documented here: install mlst
Step 3b: mlst db creation
# load dependencies
module load mlst ncbi-blast+ perl
cd Listeria_monocytogenes.chewbbaca
# Check if any long IDs are present.
# If so, makeblastdb in mlst-build will not run
grep ">" -h *.fasta | perl -lane 'print if(length($_) > 50);' | head
# If long IDs, checkout v1 which I already versioned
# when I downloaded from chewbbaca.online
git checkout v1 -- '*'
pushd ~/bin/mlst/db/pubmlst
mkdir lmonocgmlst
cd lmonocgmlst
cp Listeria_monocytogenes.chewbbaca/*.fasta .
# rename to tfa. Remove unnecessary prefix to filenames (although they are necessary elsewhere especially for Chewie)
for i in *.fasta; do mv $i $(basename $i .fasta).tfa; done;
for i in *.tfa; do target=$(sed 's/Pasteur_//' <<< $i); mv $i $target; done;
for i in *.tfa; do target=$(sed 's/-//' <<< $i); mv $i $target; done;
# Remove Pasteur_ prefix to avoid locus/allele confusion in mlst
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/Pasteur_//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Remove the dash too
perl -MFile::Copy=mv -MBio::SeqIO -e 'for $f(glob("*.tfa")){print "$f\n"; $in=Bio::SeqIO->new(-file=>$f); $out=Bio::SeqIO->new(-file=>">$f.tmp.fasta"); while($seq=$in->next_seq){$id=$seq->id; $id=~s/-//; $seq->id($id); $out->write_seq($seq);} mv("$f.tmp.fasta", $f);}'
# Make the scheme file
(echo -ne "ST\t"; \ls *.tfa | sed 's/.tfa//' | tr '\n' '\t'; echo clonal_complex
# fake genotype to avoid undefined genotype error
echo -ne 1
for i in *.tfa; do echo -ne "\t1"; done;
echo;
) > lmonocgmlst.txt
# Run automated makeblastdb to add the new scheme
mlst-make_blast_db
# ensure it worked
mlst --longlist | grep lmonocgmlst | perl -lane 'print "@F[0..7] ..."'
popd
cd my/Lmono/assemblies/
mlst --scheme lmonocgmlst *.fasta --threads 4 > lmonocgmlst.tsv
Metadata
Metadata
Assignees
Labels
No labels