Skip to content

This tool reads a fasta file (either gzippped or uncompressed) and calculates the frequency of all triplets.

License

Notifications You must be signed in to change notification settings

CreeveyLab/fast_triplet_count

Repository files navigation

fast_triplet_count

This tool reads a fasta file (either gzippped or uncompressed) and calculates the frequency of all triplets. It is designed to do this extremely quickly so that it can be run on thousands of genomes. When run an a standard desktop computer on all 44K prokaryotic genomes from Ensembl (using a bash 'for' loop to run on all the genomes), it completes in under 1hour 40 mins (approx 6 genomes per second).

Prerequisites:

zlib - which should come as standard on most operating systems.

Compilation:

cc fast_triplet_count.c -o fast_triplet_count -lz

Overview:

The tool reads a fasta file with one or more sequences and calculate all (overlapping) triplets. The input fasta file can be gzipped or uncompressed.

The output is printed to the stardard output stream.

The output consistes of a single line with the following fields:

Filename -> the file provided as input.

Length -> total legnth in nucleotides of all contigs/chromosomes/sequences in the file

T_count -> number of nucleotide bases with a T/U

C_count -> number of nucleotide bases with a C

A_count -> number of nucleotide bases with a A

G_count -> number of nucleotide bases with a G

N_count -> number of nucleotide bases with a non standard character (i.e. not a T/U, C, A or G)

Followed by the number of (overlapping) instances of all 64 possible triplets in the following order:

TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG

Finally, the the number of non-standard triplets denoted by XXX

Example

Given a zgipped fasta file called test1.fas.gz (as provided in the repository) containing the following:

>test
ATGTGAATGAAACACT

Call the software as follows:

fast_triplet_count test1.fas.gz

Given this sequence, the following overlapping triplets are possible:

ATGTGAATGAAACACT
ATG
 TGT
  GTG
   TGA
    GAA
     AAT
      ATG
       TGA
        GAA
         AAA
          AAC
           ACA
            CAC
             ACT

The output to the standard stream would be:

File_Name Length T_count C_count A_count G_count N_count TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC AAA AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG GAT GAC GAA GAG GGT GGC GGA GGG XXX
test.fas 16 4 2 7 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0

This can be caught and redirected to a file as follows:

fast_triplet_count test1.fas.gz > test1.tripletCount.txt

OR if it is necessary to run the tool on many (thousands) of genomes, it can be called with a bash for loop as follows:

for i in *.fas.gz; do fast_triplet_count $i; done > alltriplet_counts.txt

About

This tool reads a fasta file (either gzippped or uncompressed) and calculates the frequency of all triplets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages