C and Python API for two-way two-locus stats #3243

lkirk · 2025-07-25T23:18:37Z

This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some high-level tests.

codecov · 2025-07-25T23:24:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.18%. Comparing base (c23aafc) to head (c5bdbc7).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3243      +/-   ##
==========================================
- Coverage   89.61%   86.18%   -3.43%     
==========================================
  Files          28       10      -18     
  Lines       31983    17142   -14841     
  Branches     5888     3311    -2577     
==========================================
- Hits        28660    14774   -13886     
+ Misses       1888     1323     -565     
+ Partials     1435     1045     -390

Flag	Coverage Δ
c-tests	`86.70% <100.00%> (+0.10%)`	⬆️
lwt-tests	`80.38% <ø> (ø)`
python-c-tests	`?`
python-tests	`?`
python-tests-numpy1	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
c/tskit/trees.c	`90.80% <100.00%> (+0.19%)`	⬆️

... and 18 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

apragsdale

Hi @lkirk - thanks for starting this PR. I have a few suggestions for the multi-pop stat calculations and the normalization here.

apragsdale · 2025-07-26T16:14:50Z

python/tests/test_ld_matrix.py

+        nj = sample_set_sizes[j]
+        wAB_i = hap_weights[0, i]
+        wAB_j = hap_weights[0, j]
+        result[k] = (wAB_i / ni / 2) + (wAB_j / nj / 2)


Based on our discussion earlier this week, I think it's more natural to norm by the total weights and sample sizes, rather than the average. (So: (wAB_i + w_AB_j) / (ni + nj))

I remember our conversation about this, I've incorporated the change.

apragsdale · 2025-07-26T16:20:11Z

c/tskit/trees.c

+        D_j = p_AB - (p_A * p_B);
+        denom_j = sqrt(p_A * p_B * (1 - p_A) * (1 - p_B));
+
+        result[k] = (D_i * D_j) / (denom_i * denom_j);


This will spuriously produce a NAN if the allele is not variable in one of the two sample sets. From our discussion earlier this week, I think we can calculate result = (D_i * D_j) / denom, where denom = pA * (1 - pA) * pB * (1 - pB), and pA is (pA_i + pA_j) / (n_i + n_j) and likewise for pB.

I'm not sure we can combine the allele frequencies this way. Consider this example:

>>> Ai = 0.3; Bi = 0.5; Aj = 0.7; Bj = 0.2; ni = 37; nj = 39 >>> A = (Ai + Aj) / (ni + nj) >>> B = (Bi + Bj) / (ni + nj) >>> A * (1-A) * B * (1-B) 0.00011849496867350617 >>> (Ai*(1-Ai)*Bi*(1-Bi)) * (Aj*(1-Aj)*Bj*(1-Bj)) 0.0017640000000000006

Sorry, I miswrote that. I meant pA is (nA_i + nA_j) / (n_i + n_j)

Okay, I added this change. I also added some test coverage for unequal sample set sizes, so that we can see the results of this change and how the stats produced vary -- it can be quite a bit in some cases.

lkirk · 2025-07-28T06:29:27Z

python/tests/test_ld_matrix.py

-        p_aB = state[2, k] / n
+        p_AB = state[0, i] / n
+        p_Ab = state[1, i] / n
+        p_aB = state[2, i] / n


This fixes an error in the original implementation. The previous tests were rather simple and did not catch the issue.

jeromekelleher

I've gone through the code carefully and implementation-wise looks very solid.

It's hard to be sure about the low-level stuff in an automatic way and the best approach I've found to check it is to run the method in a loop and watch "top". Can you check this please and report back @lkirk? Just run a couple of very basic tests in a loop like

for _ in range(100000):
     ts.ld_matrix(...)

in a few different ways so that you're going through the various nominal code paths and check that the memory usage in top equilabrates to some reasonable value after a few seconds? It's the best way I've found of making sure refcounting is working properly.

I haven't thought about the actual stats in anyway - deferring this to @apragsdale and @petrelharp!

jeromekelleher · 2025-07-28T08:34:16Z

c/tests/test_stats.c

-    tsk_safe_free(row_positions);
-    tsk_safe_free(col_positions);
+    tsk_safe_free(positions);
+    tsk_safe_free(positions);


Double free here

I'm surprised valgrind didn't catch this, fixed.

I think it's safe given how tsk_safe_free works, but no point in having it in there

jeromekelleher · 2025-07-28T08:43:54Z

python/_tskitmodule.c

+    static char *kwlist[] = { "sample_set_sizes", "sample_sets", "indexes", "row_sites",
+        "col_sites", "row_positions", "column_positions", "mode", NULL };
+
+    PyObject *row_sites = NULL, *col_sites = NULL, *row_positions = NULL,


Can you declare these variables separately one per line please, like

PyObject *row_sites = NULL; PyObject *col_sites = NULL; // etc

It's how it's done in the rest of the code and makes it easier for mass search and replace etc later on.

It's a way of saying "watch out for this variable, it has tricky state" vs just standard declarations.

I fixed this in the ld_matrix method above as well, it didn't follow that style either.

lkirk · 2025-08-04T09:50:05Z

I've gone through the code carefully and implementation-wise looks very solid.

It's hard to be sure about the low-level stuff in an automatic way and the best approach I've found to check it is to run the method in a loop and watch "top". Can you check this please and report back @lkirk? Just run a couple of very basic tests in a loop like
for _ in range(100000):
     ts.ld_matrix(...)
in a few different ways so that you're going through the various nominal code paths and check that the memory usage in top equilabrates to some reasonable value after a few seconds? It's the best way I've found of making sure refcounting is working properly.

I wrote a small script to assess a few cases:

#!/usr/bin/env python
from argparse import ArgumentParser
from itertools import combinations_with_replacement

from tests.tsutil import get_example_tree_sequences

tss = {t.id: t.values[0] for t in get_example_tree_sequences() if t.id}
ts = tss["n=100_m=32_rho=0.5"]

sample_sets = [ts.samples()] * 3
indexes = list(combinations_with_replacement(range(len(sample_sets)), 2))

parser = ArgumentParser()
parser.add_argument("test_case", type=int, choices=[0, 1, 2])
test_case = parser.parse_args().test_case

base_args = dict(sample_sets=sample_sets, indexes=indexes)
match test_case:
    case 0:
        ld_args = dict(**base_args)
    case 1:
        ld_args = dict(sites=[list(range(56)), list(range(30, 108))], **base_args)
    case 2:
        ld_args = dict(
            positions=[list(range(18)), list(range(5, 31))], mode="branch", **base_args
        )

print(f"running test case: {test_case} with args:\n{ld_args}")
for i in range(10000):
    ts.ld_matrix(**ld_args)
    if not i % 1000:
        print(i, end=" ")
print()

then, in a loop:

#!/usr/bin/env bash
set -euo pipefail
mkdir memray-out
for test_case in {0..2}; do
    mem_profile="memray-out/leak-check-case-$test_case.bin"
    memray run -o "$mem_profile" ./leak-check "$test_case"
    echo "Generating Flamegraph for test case $test_case"
    memray flamegraph "$mem_profile"
done

Case 0 Result

Case 1 Result

Case 2 Result

And then, just to make sure I would have seen a leak if there was one, I commented out a Py_XDECREF line:

--- a/python/_tskitmodule.c
+++ b/python/_tskitmodule.c
@@ -10780,7 +10780,7 @@ TreeSequence_k_way_ld_matrix(TreeSequence *self, PyObject *args, PyObject *kwds,
     ret = (PyObject *) result_matrix;
     result_matrix = NULL;
 out:
-    Py_XDECREF(row_sites_array);
+    // Py_XDECREF(row_sites_array);
     Py_XDECREF(col_sites_array);
     Py_XDECREF(row_positions_array);
     Py_XDECREF(col_positions_array);

and the memray result from this test was:

(although perhaps it's a bit easier to see zoomed in)

In any case, I'm pretty happy with this. It's still rather unsatisfying to not have a tool that can check references. I do have some sort of prototype using tracemalloc, but it's a bit half-baked at this point.

This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.

lkirk · 2025-08-04T09:53:24Z

Rebased and squashed.

jeromekelleher · 2025-08-04T10:57:33Z

In any case, I'm pretty happy with this. It's still rather unsatisfying to not have a tool that can check references. I do have some sort of prototype using tracemalloc, but it's a bit half-baked at this point.

I'm also happy - thanks!

It's totally unsatisfying doing this manually, but it's one of these things that there just doesn't seem to be a good solution for and a bit of manual eyeball work is the most efficient approach.

apragsdale · 2025-08-04T17:33:28Z

Thanks @lkirk! Looks good to me.

apragsdale · 2025-08-04T19:07:57Z

Jotting down a note about r^2 calculations here: Lloyd and I have been discussing offline the behavior for the two-population r^2 function. If one of the two loci is fixed for a single allele across all samples in both sample sets, the returned value should be nan, mirroring the behavior of r or r^2 within a single sample set with a fixed allele at a locus.

However, if an allele is fixed in only one of the two sample sets, it's not clear that we would want to return nan, as there is variation globally at both loci. Instead, we would check if either D_1 or D_2 are zero, and return zero before computing the denominator. This would avoid nans, which otherwise would be common as there will typically be variation that is private to one or the other sample set.

jeromekelleher · 2025-08-05T08:11:01Z

I think we're just waiting for your thumbs-up on the stats here @petrelharp - I'm happy with the implementation

lkirk requested review from petrelharp and jeromekelleher July 25, 2025 23:18

apragsdale reviewed Jul 26, 2025

View reviewed changes

lkirk commented Jul 28, 2025

View reviewed changes

jeromekelleher approved these changes Jul 28, 2025

View reviewed changes

lkirk force-pushed the two-locus-multipopulation branch from c7f33ea to c5bdbc7 Compare August 4, 2025 09:52

C and Python API for two-way two-locus stats #3243

Are you sure you want to change the base?

C and Python API for two-way two-locus stats #3243

Conversation

lkirk commented Jul 25, 2025

Uh oh!

codecov bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

apragsdale left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkirk Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkirk commented Aug 4, 2025

Case 0 Result

Case 1 Result

Case 2 Result

Uh oh!

lkirk commented Aug 4, 2025

Uh oh!

jeromekelleher commented Aug 4, 2025

Uh oh!

apragsdale commented Aug 4, 2025

Uh oh!

apragsdale commented Aug 4, 2025

Uh oh!

jeromekelleher commented Aug 5, 2025

Uh oh!

Uh oh!

codecov bot commented Jul 25, 2025 •

edited

Loading

lkirk Jul 28, 2025 •

edited

Loading