-
Notifications
You must be signed in to change notification settings - Fork 77
C and Python API for two-way two-locus stats #3243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3243 +/- ##
==========================================
- Coverage 89.61% 86.18% -3.43%
==========================================
Files 28 10 -18
Lines 31983 17142 -14841
Branches 5888 3311 -2577
==========================================
- Hits 28660 14774 -13886
+ Misses 1888 1323 -565
+ Partials 1435 1045 -390
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lkirk - thanks for starting this PR. I have a few suggestions for the multi-pop stat calculations and the normalization here.
python/tests/test_ld_matrix.py
Outdated
nj = sample_set_sizes[j] | ||
wAB_i = hap_weights[0, i] | ||
wAB_j = hap_weights[0, j] | ||
result[k] = (wAB_i / ni / 2) + (wAB_j / nj / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion earlier this week, I think it's more natural to norm by the total weights and sample sizes, rather than the average. (So: (wAB_i + w_AB_j) / (ni + nj)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember our conversation about this, I've incorporated the change.
c/tskit/trees.c
Outdated
D_j = p_AB - (p_A * p_B); | ||
denom_j = sqrt(p_A * p_B * (1 - p_A) * (1 - p_B)); | ||
|
||
result[k] = (D_i * D_j) / (denom_i * denom_j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will spuriously produce a NAN if the allele is not variable in one of the two sample sets. From our discussion earlier this week, I think we can calculate result = (D_i * D_j) / denom
, where denom = pA * (1 - pA) * pB * (1 - pB)
, and pA
is (pA_i + pA_j) / (n_i + n_j)
and likewise for pB
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can combine the allele frequencies this way. Consider this example:
>>> Ai = 0.3; Bi = 0.5; Aj = 0.7; Bj = 0.2; ni = 37; nj = 39
>>> A = (Ai + Aj) / (ni + nj)
>>> B = (Bi + Bj) / (ni + nj)
>>> A * (1-A) * B * (1-B)
0.00011849496867350617
>>> (Ai*(1-Ai)*Bi*(1-Bi)) * (Aj*(1-Aj)*Bj*(1-Bj))
0.0017640000000000006
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I miswrote that. I meant pA
is (nA_i + nA_j) / (n_i + n_j)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I added this change. I also added some test coverage for unequal sample set sizes, so that we can see the results of this change and how the stats produced vary -- it can be quite a bit in some cases.
python/tests/test_ld_matrix.py
Outdated
p_aB = state[2, k] / n | ||
p_AB = state[0, i] / n | ||
p_Ab = state[1, i] / n | ||
p_aB = state[2, i] / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes an error in the original implementation. The previous tests were rather simple and did not catch the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through the code carefully and implementation-wise looks very solid.
It's hard to be sure about the low-level stuff in an automatic way and the best approach I've found to check it is to run the method in a loop and watch "top". Can you check this please and report back @lkirk? Just run a couple of very basic tests in a loop like
for _ in range(100000):
ts.ld_matrix(...)
in a few different ways so that you're going through the various nominal code paths and check that the memory usage in top equilabrates to some reasonable value after a few seconds? It's the best way I've found of making sure refcounting is working properly.
I haven't thought about the actual stats in anyway - deferring this to @apragsdale and @petrelharp!
c/tests/test_stats.c
Outdated
tsk_safe_free(row_positions); | ||
tsk_safe_free(col_positions); | ||
tsk_safe_free(positions); | ||
tsk_safe_free(positions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double free here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised valgrind didn't catch this, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's safe given how tsk_safe_free works, but no point in having it in there
python/_tskitmodule.c
Outdated
static char *kwlist[] = { "sample_set_sizes", "sample_sets", "indexes", "row_sites", | ||
"col_sites", "row_positions", "column_positions", "mode", NULL }; | ||
|
||
PyObject *row_sites = NULL, *col_sites = NULL, *row_positions = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you declare these variables separately one per line please, like
PyObject *row_sites = NULL;
PyObject *col_sites = NULL;
// etc
It's how it's done in the rest of the code and makes it easier for mass search and replace etc later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a way of saying "watch out for this variable, it has tricky state" vs just standard declarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this in the ld_matrix method above as well, it didn't follow that style either.
I wrote a small script to assess a few cases: #!/usr/bin/env python
from argparse import ArgumentParser
from itertools import combinations_with_replacement
from tests.tsutil import get_example_tree_sequences
tss = {t.id: t.values[0] for t in get_example_tree_sequences() if t.id}
ts = tss["n=100_m=32_rho=0.5"]
sample_sets = [ts.samples()] * 3
indexes = list(combinations_with_replacement(range(len(sample_sets)), 2))
parser = ArgumentParser()
parser.add_argument("test_case", type=int, choices=[0, 1, 2])
test_case = parser.parse_args().test_case
base_args = dict(sample_sets=sample_sets, indexes=indexes)
match test_case:
case 0:
ld_args = dict(**base_args)
case 1:
ld_args = dict(sites=[list(range(56)), list(range(30, 108))], **base_args)
case 2:
ld_args = dict(
positions=[list(range(18)), list(range(5, 31))], mode="branch", **base_args
)
print(f"running test case: {test_case} with args:\n{ld_args}")
for i in range(10000):
ts.ld_matrix(**ld_args)
if not i % 1000:
print(i, end=" ")
print() then, in a loop:
Case 0 Result![]() Case 1 Result![]() Case 2 Result![]() And then, just to make sure I would have seen a leak if there was one, I commented out a --- a/python/_tskitmodule.c
+++ b/python/_tskitmodule.c
@@ -10780,7 +10780,7 @@ TreeSequence_k_way_ld_matrix(TreeSequence *self, PyObject *args, PyObject *kwds,
ret = (PyObject *) result_matrix;
result_matrix = NULL;
out:
- Py_XDECREF(row_sites_array);
+ // Py_XDECREF(row_sites_array);
Py_XDECREF(col_sites_array);
Py_XDECREF(row_positions_array);
Py_XDECREF(col_positions_array); and the memray result from this test was: In any case, I'm pretty happy with this. It's still rather unsatisfying to not have a tool that can check references. I do have some sort of prototype using |
This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.
c7f33ea
to
c5bdbc7
Compare
Rebased and squashed. |
I'm also happy - thanks! It's totally unsatisfying doing this manually, but it's one of these things that there just doesn't seem to be a good solution for and a bit of manual eyeball work is the most efficient approach. |
Thanks @lkirk! Looks good to me. |
Jotting down a note about r^2 calculations here: Lloyd and I have been discussing offline the behavior for the two-population r^2 function. If one of the two loci is fixed for a single allele across all samples in both sample sets, the returned value should be nan, mirroring the behavior of r or r^2 within a single sample set with a fixed allele at a locus. However, if an allele is fixed in only one of the two sample sets, it's not clear that we would want to return nan, as there is variation globally at both loci. Instead, we would check if either D_1 or D_2 are zero, and return zero before computing the denominator. This would avoid nans, which otherwise would be common as there will typically be variation that is private to one or the other sample set. |
I think we're just waiting for your thumbs-up on the stats here @petrelharp - I'm happy with the implementation |
This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set.
Test coverage has been added to cover C, low-level python and some high-level tests.