Skip to content

Tutorial #2 Simulating realistic families

Adrián A. Davín edited this page Nov 5, 2020 · 2 revisions

A frequent question when using Zombi is: what rates should I use? Zombi is a parameter-rich model capable of simulating processes very diverse.

We found in this study that the best approach to obtain realistic families is a rejection-sampling approach. The steps are:

  1. Simulate gene families in Zombi. In the previously cited paper we used the Gm model for being more closely comparable to what we can infer using ALEml undated, but in principle, this strategy could work fine using also genome rates of evolution. We use the parameters found in here. In order to have many families (a requirement for using this pipeline), we recommend using a very strong rate of origination or running Zombi multiple times.

  2. Use ALEml undated (or a similar reconciliation algorithm) to infer the number of events of the dataset that you want to replicate in the simulation, or use the file that we provide in this folder. There you can find the events found in ~11000 gene families belonging to Bacteria. We will call this the empirical families.

  3. In the simulated dataset, compute the reconciliations between your pruned trees and the extant species tree.

  4. Then, for every empirical family, compute the distance to every simulated family, and pick the one with the smallest distance. Set the this selected family apart and repeat until there is one pick for every empirical family. To compute the distance you can use the euclidean distance between the vectors with the number of events + the size of the families.

In our analysis, we can recover families that seem to match rather well the number of events found in the real datasets: