Heavily commented, quick script for processing sequence files for length and basepair composition
This is a pure bash script for taking a sequnce file and determining its base content, total length both gapped and ungapped. The purpose of the script is to be a portable educational tool for people just learning bash scripting (such as myself)
It's not the best optimized script by any stretch of imagination, but it's simple enough that all its components should be useful for any amateur researcher looking for simple, practical code examples.
For reference, on a netbook with Celeron N3060 processor and 4GB of ram running Lubuntu 20.04 Human chromosome 1 GRCh38.p13 (about 240.8 MB file) NC_000001.11 takes below time from start to finish.
real 3m46.775s user 3m23.530s sys 0m16.992s
Agrobacterium tumefaciens strain GCF_900045375.1 takes below time from start to finish
real 0m5.423s user 0m4.901s sys 0m0.467s
The repo contains a 100bp positive control fasta file generated by a DNA synthesis script from https://github.com/naturepoker/dna-synth Running below code should output the following.
./seq_counter.sh control_100bp.fasta
##################################################
Processing control_100bp.fasta
##################################################
##################################################
Total sequence composition is as follows
--------------------------------------------------
18 A
27 T
29 C
26 G
--------------------------------------------------
Total gapped sequence length is: 100
--------------------------------------------------
Total ungapped sequence length is: 100
--------------------------------------------------
GC content in control_100bp.fasta is 55.00 %
##################################################