Sampled nodes as unary nodes in a simplified tree sequence #2144

bodkan · 2022-03-04T16:10:32Z

bodkan
Mar 4, 2022

Hello, I have a slendr-related question which boils down to my insufficient understanding of how simplification works in tskit, in this case concerning simplification of tree sequences produced by SLiM. My code is written in R but given that it simply calls tskit Python methods under the hood via reticulate I hope it's OK to ask here.

Background:
Recently I have implemented some code for slendr which can extract a given tree from a tskit/pyslim tree sequence and transform it into a standard R phylo format used across the R phylogenetics ecosystem (ape, phangorn, ggtree, etc.). It seems to be working for the most part and a student of ours has already been doing awesome stuff with spatial phylogenetic trees, using all those tools for spatial and phylogenetic analysis in R. So far so good.

Unfortunately, I now ran into issues with tree sequences which include "ancient DNA" samples, i.e. individuals who are remembered (in SLiM speak) throughout the course simulation (not just at the end). For some reason, my code for tskit Tree --> R phylo conversion started crashing on such tree sequences.

Now, I'm not entirely sure yet why it's crashing but during the debugging process I discovered something that even if it's not causing my problems directly, it definitely means I misunderstood something about the simplification process which I should understand before I proceed any further.

Example:

I have a tree sequence from a slendr/SLiM simulation. The simulation samples individuals throughout the course of the run (i.e. "remembers" them, to use the SLiM vocabulary). I don't think the details of the simulation are very important so I will skip the slendr R code that executes it. One thing that I should mention is that the SLiM simulation is being run with initializeTreeSeq(retainCoalescentOnly = T) and individuals are "sampled" with sim.treeSeqRememberIndividuals(..., permanent = T) (both of which are the default, I believe).

I load the tree sequence and simplify it to a couple of individuals ("ancient" as well as "present-day"). This is the slendr code that does that:

ts <- ts_load(model) %>% ts_simplify(c("EUR_364", "EUR_400", "EUR_304", "EUR_315"))

This probably looks alien to all of you, but internally this is translated to something like this (canibalizing incomplete snippets from across the slendr guts, just to demonstrate the basic idea):

ts_tskit <- tskit$load(<path to a .trees file generated by `model`>)

ts_pyslim <- pyslim$SlimTreeSequence(ts_tskit)

[...]

ts_simplified <- ts_pyslim$simplify(
  as.integer(<list of numeric IDs belonging to the named individuals>),
  filter_populations = FALSE,
  keep_input_roots = FALSE
)

For those not familiar with R reticulate, those $ basically translate to the standard . OOP notation in Python. That code chunk above is literally executed in Python, through the reticulate interface. What I mean by this is that I don't think there's anything that I'm doing here which wouldn't translate 1:1 to Python.

When I then plot a tree from this tree sequence, I get this (the code behind this again just translates to ts.at_index and some SVG drawing):

The "named" nodes correspond to individuals "sampled" by slendr (i.e. "remembered" in the SLiM tree sequence).

My question: why is the node 0 (one chromosome of the individual "EUR_304") represented by an unary node along a branch? If 0 is the ancestor of nodes 4 (one chromosome of individual "EUR_364") and 7 (one chromosome of individual "EUR_400"), shouldn't it be in place of the node 8, basically "replacing" it during the simplification (given that 0 is the ancestor of 4 and 7)? This is assuming that simplification should remove all nodes that are not true coalescent nodes relevant for reconstructing the genealogical history of a given set of nodes (most often "sample nodes").

Does what I'm saying make any sense?

If it does, am I missing something in the tskit/pyslim interface related to simplification that I can do to make the tree a phylogenetic tree where non-leaf nodes have two children (i.e. "proper" coalescent nodes)? I understand that in certain sense a "sampled" node could be considered a slightly different thing than a "anonymous coalescent node" but still, is there a way to force the simplification to "collapse" the branch 0-8 to just 0? My thinking was that the only difference between the nodes 0 and 8 is that 0 carries a flag tskit.NODE_IS_SAMPLE?

I hope this is not too confusing. I have just emerged from two days spent chasing false leads in the tskit.Tree --> R phylo conversion and I needed to write this down quickly while my brain has any energy left. :) Happy to elaborate further.

Thank you for any pointers!

I will again tag my slendr collaborators @petrelharp and @bhaller because they are probably the only two people here familiar a bit with what I'm doing. I think @mkravn will also find this interesting given the spatial popgen work she's been doing in our group.

Answered by jeromekelleher

Mar 4, 2022

Node EUR_304 is marked as a "sample" node @bodkan since you provided it as an argument to simplify, which means that it will always be present in all trees. Samples don't have to be leaf nodes, they can be internal too.

I'm not sure what can be done here if phylogenetic trees don't support internal samples.

View full answer

bhaller · 2022-03-04T16:19:44Z

bhaller
Mar 4, 2022
Collaborator

I do not know, but I'll be interested to hear what @petrelharp says. :->

0 replies

jeromekelleher · 2022-03-04T17:32:06Z

jeromekelleher
Mar 4, 2022
Maintainer

Node EUR_304 is marked as a "sample" node @bodkan since you provided it as an argument to simplify, which means that it will always be present in all trees. Samples don't have to be leaf nodes, they can be internal too.

I'm not sure what can be done here if phylogenetic trees don't support internal samples.

10 replies

bodkan Mar 7, 2022
Author

I don't think I'm ready to ditch ape/phangorn just yet. :) I think they are the main reason for using R for phylogenetics, with entire books and courses around them (also, ggtree just rules, although as a ggplot2 addict I'm clearly biased :)). Even apart from that, despite this issue, we are already using some handy functionality for traversing and plotting of tskit --> R phylo converted trees in R. That is, in scenarios in which the "dense" historical sampling described in this thread isn't a problem (the majority of situations, I'd say?).

I think doing the above is a reasonable compromise, at least for the time being. It's easy to track the "dummy" nodes and write an informative warning so that nothing is hidden from the user (in those rare cases where this happens).

petrelharp Mar 7, 2022
Maintainer

Understandable - I'm just saying. =)

jeromekelleher Mar 7, 2022
Maintainer

Totally shouldn't ditch ape/phangorm!

bodkan Mar 7, 2022
Author

Understandable - I'm just saying. =)

I know! I have visited some dark places when I was debugging this last week... :)

Honestly not sure how practically useful this is going to be, but at least plotting those trees (either the usual way or in space on a map) gives pretty sweet figures so it's worth the effort even just for that.

Totally shouldn't ditch ape/phangorm!

We will get there! It might not be perfect at first, but we will get there. 💪

jeromekelleher Mar 7, 2022
Maintainer

It's an essential part of "tskitR" (just sayin!)

jeromekelleher · 2022-03-07T10:16:50Z

jeromekelleher
Mar 7, 2022
Maintainer

Oh, right. Of course. I didn't realize the fundamental difference between the node 0 (never a coalescent node, regardless of how we simplify) and node 8 (true coalescent node) in my first figure. Sorry about that.

Nothing to apologise for @bodkan, simplify is a tricky beast!

I think your work around for unary nodes in phylo objects sounds good. I guess the only other option would be to swap the dummy node and 0, so that 0 is still the parent of 8 and you just have a dangling leaf going to nowhere? Then at least (in principle) you can recover the original tree sequence topology, and if you simplify that then dummy node will disappear.

Are phylo objects OK with internal samples, other than requiring that they have more than one child, or do all samples have to be leaves (which would seem odd since there are lots of time-sampled HIV trees, etc)?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampled nodes as unary nodes in a simplified tree sequence #2144

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sampled nodes as unary nodes in a simplified tree sequence #2144

bodkan Mar 4, 2022

Replies: 3 comments · 10 replies

bhaller Mar 4, 2022 Collaborator

jeromekelleher Mar 4, 2022 Maintainer

bodkan Mar 7, 2022 Author

petrelharp Mar 7, 2022 Maintainer

jeromekelleher Mar 7, 2022 Maintainer

bodkan Mar 7, 2022 Author

jeromekelleher Mar 7, 2022 Maintainer

jeromekelleher Mar 7, 2022 Maintainer

bodkan
Mar 4, 2022

Replies: 3 comments 10 replies

bhaller
Mar 4, 2022
Collaborator

jeromekelleher
Mar 4, 2022
Maintainer

bodkan Mar 7, 2022
Author

petrelharp Mar 7, 2022
Maintainer

jeromekelleher Mar 7, 2022
Maintainer

bodkan Mar 7, 2022
Author

jeromekelleher Mar 7, 2022
Maintainer

jeromekelleher
Mar 7, 2022
Maintainer