Sampled nodes as unary nodes in a simplified tree sequence #2144
-
Hello, I have a slendr-related question which boils down to my insufficient understanding of how simplification works in tskit, in this case concerning simplification of tree sequences produced by SLiM. My code is written in R but given that it simply calls tskit Python methods under the hood via reticulate I hope it's OK to ask here. Background: Unfortunately, I now ran into issues with tree sequences which include "ancient DNA" samples, i.e. individuals who are remembered (in SLiM speak) throughout the course simulation (not just at the end). For some reason, my code for tskit Tree --> R phylo conversion started crashing on such tree sequences. Now, I'm not entirely sure yet why it's crashing but during the debugging process I discovered something that even if it's not causing my problems directly, it definitely means I misunderstood something about the simplification process which I should understand before I proceed any further. Example: I have a tree sequence from a slendr/SLiM simulation. The simulation samples individuals throughout the course of the run (i.e. "remembers" them, to use the SLiM vocabulary). I don't think the details of the simulation are very important so I will skip the slendr R code that executes it. One thing that I should mention is that the SLiM simulation is being run with I load the tree sequence and simplify it to a couple of individuals ("ancient" as well as "present-day"). This is the slendr code that does that: ts <- ts_load(model) %>% ts_simplify(c("EUR_364", "EUR_400", "EUR_304", "EUR_315")) This probably looks alien to all of you, but internally this is translated to something like this (canibalizing incomplete snippets from across the slendr guts, just to demonstrate the basic idea): ts_tskit <- tskit$load(<path to a .trees file generated by `model`>)
ts_pyslim <- pyslim$SlimTreeSequence(ts_tskit)
[...]
ts_simplified <- ts_pyslim$simplify(
as.integer(<list of numeric IDs belonging to the named individuals>),
filter_populations = FALSE,
keep_input_roots = FALSE
) For those not familiar with R reticulate, those When I then plot a tree from this tree sequence, I get this (the code behind this again just translates to The "named" nodes correspond to individuals "sampled" by slendr (i.e. "remembered" in the SLiM tree sequence). My question: why is the node 0 (one chromosome of the individual "EUR_304") represented by an unary node along a branch? If 0 is the ancestor of nodes 4 (one chromosome of individual "EUR_364") and 7 (one chromosome of individual "EUR_400"), shouldn't it be in place of the node 8, basically "replacing" it during the simplification (given that 0 is the ancestor of 4 and 7)? This is assuming that simplification should remove all nodes that are not true coalescent nodes relevant for reconstructing the genealogical history of a given set of nodes (most often "sample nodes"). Does what I'm saying make any sense? If it does, am I missing something in the tskit/pyslim interface related to simplification that I can do to make the tree a phylogenetic tree where non-leaf nodes have two children (i.e. "proper" coalescent nodes)? I understand that in certain sense a "sampled" node could be considered a slightly different thing than a "anonymous coalescent node" but still, is there a way to force the simplification to "collapse" the branch I hope this is not too confusing. I have just emerged from two days spent chasing false leads in the Thank you for any pointers! I will again tag my slendr collaborators @petrelharp and @bhaller because they are probably the only two people here familiar a bit with what I'm doing. I think @mkravn will also find this interesting given the spatial popgen work she's been doing in our group. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 10 replies
-
I do not know, but I'll be interested to hear what @petrelharp says. :-> |
Beta Was this translation helpful? Give feedback.
-
Node EUR_304 is marked as a "sample" node @bodkan since you provided it as an argument to simplify, which means that it will always be present in all trees. Samples don't have to be leaf nodes, they can be internal too. I'm not sure what can be done here if phylogenetic trees don't support internal samples. |
Beta Was this translation helpful? Give feedback.
-
Nothing to apologise for @bodkan, simplify is a tricky beast! I think your work around for unary nodes in phylo objects sounds good. I guess the only other option would be to swap the dummy node and 0, so that 0 is still the parent of 8 and you just have a dangling leaf going to nowhere? Then at least (in principle) you can recover the original tree sequence topology, and if you simplify that then dummy node will disappear. Are phylo objects OK with internal samples, other than requiring that they have more than one child, or do all samples have to be leaves (which would seem odd since there are lots of time-sampled HIV trees, etc)? |
Beta Was this translation helpful? Give feedback.
Node EUR_304 is marked as a "sample" node @bodkan since you provided it as an argument to simplify, which means that it will always be present in all trees. Samples don't have to be leaf nodes, they can be internal too.
I'm not sure what can be done here if phylogenetic trees don't support internal samples.