Skip to content

Family wise rates (Gm)

Adrián A. Davín edited this page Jul 18, 2019 · 16 revisions

This model allows the user to use gene family rates for events of duplication, transfer and loss, instead of genome-wise rates. The main advantage of this model is that gene family rates are easier to estimate from real datasets. You can see more about this in the Tutorial #2, where we simulate genomes using D, T and L rates inferred from a real dataset of Cyanobacteria.

What is the difference between gene-family rates and genome-rates?

In the basic mode G, the frequency of events is controlled using genome-wise events. For instance, let us say that, in the basic mode G, the Duplication rate is set to 1 and the Loss rate is set to 0.5. All the other rates are set to 0. The extension parameter for both Duplications and Losses is set to 1 (one gene affected per event). This gives a total rate of 1.5. For a given branch, the occurrence time of the next event is computed by sampling a number from an exponential distribution with lambda = 1 / 1.5. The specific event is sample according to the rate (duplication are twice as likely than losses) and it will affect a single gene randomly chosen

In the Gm mode, some of the events (D, T, L) can use family-wise rates, specific for each gene family. Let us look at the next genome and the corresponding rates of its different Gene Families

Example Gm

As you can see, it contains 6 genes from 3 different Gene Families. To compute the occurrence time of the next event, Zombi performs a weighted sum over all the genes in the genome and then samples from an exponential distribution. The effective rate in this example is 23 (1 for the yellow gene, 12 for the red genes and 10 for the blue genes).

If in this example, the origination rate is set to 2, the effective rate to compute the next event would be 25, since Originations use still genome-wise rates.

Think carefully what rates you are using because it can make a big difference from not having enough events or having too many!

Extensions

If an event occurs simultaneously to more than one gene, Zombi performs these computations to determine the exact genes affected:

  1. Determine the extension of the event
  2. For every starting point in the genome (every gene), compute the weighted probability of the event affecting that region, by multiplying the rate of the event of every gene affected by the event.
  3. Choose the starting point according to the weights computed in 2.

This can result in some interesting emergent properties, such as making genes in the vicinity of essential genes (genes with a loss rate equals to 0) harder to be lost.

Choosing your rates

There are two main ways of choosing the rates in the simulation

In the first case, you simply provide those rates by changing DUPLICATION, TRANSFER and LOSS in the GenomeParameters.tsv file. For example, you can use:

DUPLICATION f:0.2

TRANSFER f:0.2

LOSS u:0;0.1

All gene families will have a fixed rate of 0.2 for duplications and transfers and a random value for losses, sampled from a uniform distribution U(0, 0.1).

Remember that these rates are measured in number of events per unit of time per gene.

The second way is that you provide a file to Zombi such as this one.

You need a header and 4 columns containing the gene family name (not use), and Duplication, Loss and Transfer rates. Every time a new family is added to the Genome (even for those gene families in the Initial Genome), the gene family samples a row from this file to determine the rates.

Follow Tutorial 2 if you want to understand how to use user-given rates

Output

Family_rates.tsv

In the G folder, a file containing 4 columns corresponding to the name of the gene family, the duplication rate, the transfer rate and the loss rate. All of those rates are gene-family rates

Parameters

RATE_FILE

False by default. Change to a complete path to the file containing

SCALE_RATES

False by default. If you are using estimated rates with ALE undated, you want to change this to True. This will correct the estimated rates to fit the time units of your tree. It simply divides the rate estimated by ALE by the crown length of your tree