-
Notifications
You must be signed in to change notification settings - Fork 86
Module: Alphabets
Hannes Hauswedell edited this page Mar 13, 2017
·
10 revisions
alph/alphabet.hpp // Alphabet concept, generic ostream, serialization
alph/alphabet_container.hpp // generic alphabet traits for basic_string adaption and string->ostream adapters; container-serialization
alph/nucl/dna4.hpp // dna alphabet definition; alias from dna4 to dna
alph/nucl/dna4_container.hpp // dna traits specialization; aliases for vector; literal
alph/nucl/dna5.hpp // plus N
alph/nucl/dna5_container.hpp
alph/nucl/nucl16.hpp // full IUPAC code, U and T as distinct characters
alph/nucl/nucl16_container.hpp
alph/nucl/rna4.hpp // rna4 alphabet definition; alias from rna4 to rna; inherits dna4
alph/nucl/rna4_container.hpp // n; aliases for vector; literal
alph/nucl/rna5.hpp // ...
alph/nucl/rna5_container.hpp
alph/nucl/conversion.hpp // code for converting between differenct nucl alphabets and containers
alph/nucl/conversion_container.hpp // code for converting containers; view implementation
alph/aminoacid.hpp
alph/aminoacid/aa27.hpp // amino acid (27 letter code)
alph/aminoacid/aa27_container.hpp
alph/aminoacid/aa10murphy.hpp // murphy reduction (10 letter code)
alph/aminoacid/aa10murphy_container.hpp
alph/aminoacid/conversion.hpp // code for converting between different amino acid alphabets and containers
alph/aminoacid/conversion_container.hpp // code for converting containers; view implementations
alph/quality.hpp
alph/quality/phred.hpp // phred quality scores
alph/gaps.hpp
alph/gaps/gapped_alphabet.hpp // an alphabet that wraps another alphabet and adds a gap character
alph/gaps/gaps.hpp // a stand-alone 0-or-1 alphabet that can be included in a compound alphabet
alph/translation.hpp // code for translating nucl -> amino acid
- the rna* alphabets inherit corresponding dna* alphabets and just overwrite
value_to_charstatic member - there will be an alias from
dna4todnaandrna4torna - quality is an independent alphabet (likely won't be implemented during the retreat)
- there will be a
compound_alphabetconcept, where a character can consist of multiple characters, e.g.dna5andphred; by default this will use two bytes, but bit-compressing containers may/shall compress this to less than a byte - we support general containers like
std::vector.std::stringwill work, but only in a limited fashion (and isn't recommended)
- should the default
dnabedna4ordna16?
There are two general designs one can pick, either save the numeric ranks of each alphabet letter internally, i.e. 0, 1, 2, 3 for dna (or a an enum with these values) = the "rank approach".
OR save the actual char values internally, i.e. 64, 66 .... the "char like" approach.
In any case we would want the type to be POD and be somewhat usable in both general containers (e.g. std::vector) and in std::basic_string.
pro:
- no conversion when working with rank which is what you do in indexing and alignment and all serious work-loads
- default initialized alphabet character has valid alphabet value, because 0 is a valid alphabet value
con:
- no initialization from char, because no user defined constructors (user defined assignment works, but would be confusing to have one and not the other)
- implicit conversion to char possible, but implicit conversion to numeric value makes more sense; explicit conversion to char ok, this implies table lookup
- can be used in basic_strings, but some things are broken, e.g. the basic_string's char_traits' strlen function and possibly some other things [because they expect a 0 terminator which is a member of our alphabet]
pro:
- reading as char is free, because the internal char needs no conversion
- can be assigned and constructed from char, although this implies a narrowing conversion to the actual alphabet
- easier conversion between alphabets, needs only n tables for n alphabets, not nΒ²
- can be used in basic_strings and with no problem; better compatibility with c-strings
con:
- getting the rank (or ord-value) implies a table lookup which may be expensive in the long run, especially since this is used very often
- default initialized character has value 0 which is outside the alphabet
we decided on the rank approach for performance reasons.