Skip to content

C and Python API for two-way two-locus stats #3243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Jul 25, 2025

This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some high-level tests.

Copy link

codecov bot commented Jul 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.18%. Comparing base (c23aafc) to head (c5bdbc7).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3243      +/-   ##
==========================================
- Coverage   89.61%   86.18%   -3.43%     
==========================================
  Files          28       10      -18     
  Lines       31983    17142   -14841     
  Branches     5888     3311    -2577     
==========================================
- Hits        28660    14774   -13886     
+ Misses       1888     1323     -565     
+ Partials     1435     1045     -390     
Flag Coverage Δ
c-tests 86.70% <100.00%> (+0.10%) ⬆️
lwt-tests 80.38% <ø> (ø)
python-c-tests ?
python-tests ?
python-tests-numpy1 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/trees.c 90.80% <100.00%> (+0.19%) ⬆️

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@apragsdale apragsdale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lkirk - thanks for starting this PR. I have a few suggestions for the multi-pop stat calculations and the normalization here.

nj = sample_set_sizes[j]
wAB_i = hap_weights[0, i]
wAB_j = hap_weights[0, j]
result[k] = (wAB_i / ni / 2) + (wAB_j / nj / 2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion earlier this week, I think it's more natural to norm by the total weights and sample sizes, rather than the average. (So: (wAB_i + w_AB_j) / (ni + nj))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember our conversation about this, I've incorporated the change.

c/tskit/trees.c Outdated
D_j = p_AB - (p_A * p_B);
denom_j = sqrt(p_A * p_B * (1 - p_A) * (1 - p_B));

result[k] = (D_i * D_j) / (denom_i * denom_j);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will spuriously produce a NAN if the allele is not variable in one of the two sample sets. From our discussion earlier this week, I think we can calculate result = (D_i * D_j) / denom, where denom = pA * (1 - pA) * pB * (1 - pB), and pA is (pA_i + pA_j) / (n_i + n_j) and likewise for pB.

Copy link
Contributor Author

@lkirk lkirk Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can combine the allele frequencies this way. Consider this example:

>>> Ai = 0.3; Bi = 0.5; Aj = 0.7; Bj = 0.2; ni = 37; nj = 39
>>> A = (Ai + Aj) / (ni + nj)
>>> B = (Bi + Bj) / (ni + nj)
>>> A * (1-A) * B * (1-B)
0.00011849496867350617
>>> (Ai*(1-Ai)*Bi*(1-Bi)) * (Aj*(1-Aj)*Bj*(1-Bj))
0.0017640000000000006

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I miswrote that. I meant pA is (nA_i + nA_j) / (n_i + n_j)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I added this change. I also added some test coverage for unequal sample set sizes, so that we can see the results of this change and how the stats produced vary -- it can be quite a bit in some cases.

p_aB = state[2, k] / n
p_AB = state[0, i] / n
p_Ab = state[1, i] / n
p_aB = state[2, i] / n
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes an error in the original implementation. The previous tests were rather simple and did not catch the issue.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the code carefully and implementation-wise looks very solid.

It's hard to be sure about the low-level stuff in an automatic way and the best approach I've found to check it is to run the method in a loop and watch "top". Can you check this please and report back @lkirk? Just run a couple of very basic tests in a loop like

for _ in range(100000):
     ts.ld_matrix(...)

in a few different ways so that you're going through the various nominal code paths and check that the memory usage in top equilabrates to some reasonable value after a few seconds? It's the best way I've found of making sure refcounting is working properly.

I haven't thought about the actual stats in anyway - deferring this to @apragsdale and @petrelharp!

tsk_safe_free(row_positions);
tsk_safe_free(col_positions);
tsk_safe_free(positions);
tsk_safe_free(positions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double free here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised valgrind didn't catch this, fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's safe given how tsk_safe_free works, but no point in having it in there

static char *kwlist[] = { "sample_set_sizes", "sample_sets", "indexes", "row_sites",
"col_sites", "row_positions", "column_positions", "mode", NULL };

PyObject *row_sites = NULL, *col_sites = NULL, *row_positions = NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you declare these variables separately one per line please, like

 PyObject *row_sites = NULL;
 PyObject *col_sites = NULL;
// etc

It's how it's done in the rest of the code and makes it easier for mass search and replace etc later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a way of saying "watch out for this variable, it has tricky state" vs just standard declarations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in the ld_matrix method above as well, it didn't follow that style either.

@lkirk
Copy link
Contributor Author

lkirk commented Aug 4, 2025

I've gone through the code carefully and implementation-wise looks very solid.

It's hard to be sure about the low-level stuff in an automatic way and the best approach I've found to check it is to run the method in a loop and watch "top". Can you check this please and report back @lkirk? Just run a couple of very basic tests in a loop like

for _ in range(100000):
     ts.ld_matrix(...)

in a few different ways so that you're going through the various nominal code paths and check that the memory usage in top equilabrates to some reasonable value after a few seconds? It's the best way I've found of making sure refcounting is working properly.

I wrote a small script to assess a few cases:

#!/usr/bin/env python
from argparse import ArgumentParser
from itertools import combinations_with_replacement

from tests.tsutil import get_example_tree_sequences

tss = {t.id: t.values[0] for t in get_example_tree_sequences() if t.id}
ts = tss["n=100_m=32_rho=0.5"]

sample_sets = [ts.samples()] * 3
indexes = list(combinations_with_replacement(range(len(sample_sets)), 2))

parser = ArgumentParser()
parser.add_argument("test_case", type=int, choices=[0, 1, 2])
test_case = parser.parse_args().test_case

base_args = dict(sample_sets=sample_sets, indexes=indexes)
match test_case:
    case 0:
        ld_args = dict(**base_args)
    case 1:
        ld_args = dict(sites=[list(range(56)), list(range(30, 108))], **base_args)
    case 2:
        ld_args = dict(
            positions=[list(range(18)), list(range(5, 31))], mode="branch", **base_args
        )

print(f"running test case: {test_case} with args:\n{ld_args}")
for i in range(10000):
    ts.ld_matrix(**ld_args)
    if not i % 1000:
        print(i, end=" ")
print()

then, in a loop:

#!/usr/bin/env bash
set -euo pipefail
mkdir memray-out
for test_case in {0..2}; do
    mem_profile="memray-out/leak-check-case-$test_case.bin"
    memray run -o "$mem_profile" ./leak-check "$test_case"
    echo "Generating Flamegraph for test case $test_case"
    memray flamegraph "$mem_profile"
done

Case 0 Result

image

Case 1 Result

image

Case 2 Result

image

And then, just to make sure I would have seen a leak if there was one, I commented out a Py_XDECREF line:

--- a/python/_tskitmodule.c
+++ b/python/_tskitmodule.c
@@ -10780,7 +10780,7 @@ TreeSequence_k_way_ld_matrix(TreeSequence *self, PyObject *args, PyObject *kwds,
     ret = (PyObject *) result_matrix;
     result_matrix = NULL;
 out:
-    Py_XDECREF(row_sites_array);
+    // Py_XDECREF(row_sites_array);
     Py_XDECREF(col_sites_array);
     Py_XDECREF(row_positions_array);
     Py_XDECREF(col_positions_array);

and the memray result from this test was:
image
(although perhaps it's a bit easier to see zoomed in)
image

In any case, I'm pretty happy with this. It's still rather unsatisfying to not have a tool that can check references. I do have some sort of prototype using tracemalloc, but it's a bit half-baked at this point.

This PR implements the C and Python API for computing two-way two-locus
statistics. The algorithm is identical to the python version, except
during testing I uncovered a small issue with normalisation. We need to
handle the case where sample sets are of different sizes. The fix for
this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some
high-level tests.
@lkirk lkirk force-pushed the two-locus-multipopulation branch from c7f33ea to c5bdbc7 Compare August 4, 2025 09:52
@lkirk
Copy link
Contributor Author

lkirk commented Aug 4, 2025

Rebased and squashed.

@jeromekelleher
Copy link
Member

In any case, I'm pretty happy with this. It's still rather unsatisfying to not have a tool that can check references. I do have some sort of prototype using tracemalloc, but it's a bit half-baked at this point.

I'm also happy - thanks!

It's totally unsatisfying doing this manually, but it's one of these things that there just doesn't seem to be a good solution for and a bit of manual eyeball work is the most efficient approach.

@apragsdale
Copy link

Thanks @lkirk! Looks good to me.

@apragsdale
Copy link

Jotting down a note about r^2 calculations here: Lloyd and I have been discussing offline the behavior for the two-population r^2 function. If one of the two loci is fixed for a single allele across all samples in both sample sets, the returned value should be nan, mirroring the behavior of r or r^2 within a single sample set with a fixed allele at a locus.

However, if an allele is fixed in only one of the two sample sets, it's not clear that we would want to return nan, as there is variation globally at both loci. Instead, we would check if either D_1 or D_2 are zero, and return zero before computing the denominator. This would avoid nans, which otherwise would be common as there will typically be variation that is private to one or the other sample set.

@jeromekelleher
Copy link
Member

I think we're just waiting for your thumbs-up on the stats here @petrelharp - I'm happy with the implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants