Skip to content

Add --case-confidence flag for weighted case/control status #85

@berntpopp

Description

@berntpopp

Summary

Add support for sample-level confidence weights in case/control assignment, enabling weighted logistic regression and weighted SKAT instead of hard binary phenotype classification.

Motivation

In multi-cohort analyses, phenotype certainty varies by data source:

Source Confidence Example
Confirmed clinical diagnosis 1.0 Stoneformers cohort with detailed lab data
Retrospective expert curation 0.7 GCKD with curated phenotype categories
Indication panel inference 0.4 AGDE where "ordered stone diagnostics" → inferred stones

Currently, all cases are treated equally (case_status = 1). This forces users to either:

  1. Accept phenotype misclassification noise (include uncertain cases at full weight)
  2. Exclude uncertain cases entirely (lose statistical power)
  3. Run separate tiered analyses (multiplies compute, complicates interpretation)

Weighted case status is more principled — it downweights noisy phenotypes automatically in a single run.

Current behavior

compute_phenotype_based_case_control_assignment() in helpers.py returns tuple[set[str], set[str]] — pure binary membership. The phenotype vector passed to all tests is np.ndarray of 0/1 integers.

Proposed behavior

New CLI parameters

--case-confidence <method>     Weight method: "equal" (default, current behavior),
                                "count" (fraction of case HPO terms matched),
                                "file" (read from phenotype file column)
--case-confidence-column <col>  Column name in phenotype file for pre-computed weights
                                (used when --case-confidence file)

Pre-computed weights via phenotype file (primary use case)

LIMS_ID,case_status,case_weight
LB24-1234,1,1.0
LB21-5678,1,0.7
LB24-9012,1,0.4
LB24-0000,0,1.0

When --case-confidence file --case-confidence-column case_weight:

  • Cases retain their binary case/control assignment (for Fisher's and reporting)
  • The weight column is loaded as sample_weights: np.ndarray and passed to regression-based tests

HPO match count weights (automatic)

When --case-confidence count:

  • Weight = len(sample_hpo_terms ∩ case_hpo_terms) / len(case_hpo_terms)
  • A sample matching 6/8 case HPO terms gets weight 0.75
  • A sample matching 1/8 gets weight 0.125

Integration with statistical tests

Test Weight handling
Fisher's exact Ignores weights (integer counts only) — emit info log
Logistic burden statsmodels.Logit().fit(freq_weights=sample_weights)
Linear burden statsmodels.WLS(weights=sample_weights)
SKAT/SKAT-O Weights into null model fitting
ACAT Works on p-values from weighted tests — no changes needed

Default behavior

--case-confidence equal (the default) produces sample_weights = np.ones(n_samples), preserving exact backward compatibility.

Implementation sketch

  1. Add sample_weights: np.ndarray | None field to AssociationConfig
  2. Extend compute_phenotype_based_case_control_assignment() to optionally return weights
  3. Add weight loading from phenotype file column in phenotype.py
  4. Pass freq_weights to LogisticBurdenTest.run() and LinearBurdenTest.run()
  5. Pass weights to SKAT null model
  6. Report effective sample size (sum(weights)^2 / sum(weights^2)) in diagnostics
  7. Fisher's test: log info message that weights are ignored for this test type

Relationship to other features

Use case

Multi-cohort rare variant meta-analysis where phenotype quality varies across cohorts. Users pre-compute confidence scores based on their domain knowledge and provide them in the phenotype file, enabling a single weighted analysis instead of multiple tiered runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions