Ripples server output filtering and mutation columns#436
Merged
Conversation
Add filtering to post_filtration.cpp: merge rows having the same node IDs with breakpoints that overlap completely, i.e. one is fully contained in the other, and limit to top 3 by parsimony improvement per recomb_id. Add columns with mutations found at each node to final_recombination.tsv.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds filtering to src/ripples/post_filtration/post_filtration.cpp that greatly reduces the number of rows in final_recombination.tsv while keeping the best results, i.e. the ones with the greatest parsimony improvements for each potential recombinant node. This PR also adds three columns to final_recombination.tsv: the mutations found at each node (recombinant, donor and acceptor), which will enable usher.bio to draw RIVET-like diagrams showing which recombinant mutations come from donor vs. acceptor.
The new filtering logic merges rows A and B having the same {recomb_id, donor_id, acceptor_id} if both bp1 and bp2 meet the condition that either A's range is completely contained in B's range or vice versa. For example, (0, 241) is completely contained in (0, 670) but (0, 670) is not completely contained in (241, 2090). The merged row has the larger bp1 and the larger bp2.
After merging, the results are sorted, highest parsimony improvement first. Then, the top three results by parsimony improvement for each recomb_id are retained (including ties for third place) and lower-scoring results are discarded. It would be straightforward to add a parameter to the server, passed down to post_filtration, to retain the top N instead of top 3 (hardcoded). I picked 3 to avoid swamping the users of usher.bio with too many scenarios to evaluate.