Skip to content

Full genome (Gf)

Adrián A. Davín edited this page Jul 18, 2019 · 9 revisions

Gf - Full genome (Gf)

It can be used to simulate intergenic regions and how their length is affected by different events.

To better understand this, let us look at the figure below. Here we have an example of a genome with 5 genes and 5 intergenic regions. Their respective lengths are in the tables on the right. For the sake of clarity, you can see also a linear version of the genome.

alt text

We will use this genome to illustrate how events affect gene and intergene regions. But first of all, let us see how the legth of a segment affected by an event is determined. Zombi starts choosing randomly a position within the intergenes. Notice that the intergene region 4 (between D and E) has a length of 0 but it still could be chosen, since what is selected is the position between the "nucleotides" and not the nucleotides themselves. In the figure below, the position sampled is indicated by the black arrow.

Second, a direction is chosen randomly (either left or right), and the extension of the event is obtained by sampling from a geometric distribution (more details on this at the end of this explanation). In a first attempt (see the table), the direction is "right" and the extension of the event is 6.

In this case, the end of the event ends within a gene, which is considered a failure. A new direction and a new extension is chosen randomly until the end of the event ends within an intergene region. In the example this takes place in the 4th attempt (by default, if after 50 trials there is no success, the event does not take place)

alt text

Now let us see the different events, beginning with originations

alt text

Originations of new genes take place in a position within intergenes, but they do not have a parameter extension. A new gene is inserted in an intergene region, that is cut in two.

alt text

When a transposition takes place, the segment containing genes and intergenes is cut and inserted in a different position (outside the cut segment)

alt text

Inversion change the orientation of the genes affected

alt text

Transfer move genes between different genomes No replacement transfers are considered in this model

alt text

Duplication copy and paste the selected segment of the genome, right after the end of it.

alt text

There are two types of losses, and the frequency at which they occur can be established by changing the parameter PSEUDOGENIZATION (between 0 and 1), that gives the probability that a loss event is also a pseudogenization event.

If it is not, the affected region is removed completely from the genome. Otherwise, the genes affected become intergene regions.

Output

Besides the files of the basic mode G, it will also output the Genes and the intergene lengths in the folder G/Genomes, in the files _LENGTHS.tsv

Parameters

INTERGENE_LENGTHS

The mean length of the intergene region is controlled by the parameter length INTERGENE_LENGTH. When the simulation starts, the genome is created with a number of genes equals to INITIAL_GENOME_SIZE. There is also an equal number of intergene regions. The size of each is determined by sampling from a flat Dirichlet distribution with k = INITIAL_GENOME_SIZE and multiplied by k * INTERGENE_LENGTH.

Parametrization of the extensions

The number is sampled from a geometric distribution with a parameter p equals to event_extension / ( intergene_length + mean(gene_length)). So far, this model does not support extension sampled from a non-geometric distribution.