-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Add support for sample-level confidence weights in case/control assignment, enabling weighted logistic regression and weighted SKAT instead of hard binary phenotype classification.
Motivation
In multi-cohort analyses, phenotype certainty varies by data source:
| Source | Confidence | Example |
|---|---|---|
| Confirmed clinical diagnosis | 1.0 | Stoneformers cohort with detailed lab data |
| Retrospective expert curation | 0.7 | GCKD with curated phenotype categories |
| Indication panel inference | 0.4 | AGDE where "ordered stone diagnostics" → inferred stones |
Currently, all cases are treated equally (case_status = 1). This forces users to either:
- Accept phenotype misclassification noise (include uncertain cases at full weight)
- Exclude uncertain cases entirely (lose statistical power)
- Run separate tiered analyses (multiplies compute, complicates interpretation)
Weighted case status is more principled — it downweights noisy phenotypes automatically in a single run.
Current behavior
compute_phenotype_based_case_control_assignment() in helpers.py returns tuple[set[str], set[str]] — pure binary membership. The phenotype vector passed to all tests is np.ndarray of 0/1 integers.
Proposed behavior
New CLI parameters
--case-confidence <method> Weight method: "equal" (default, current behavior),
"count" (fraction of case HPO terms matched),
"file" (read from phenotype file column)
--case-confidence-column <col> Column name in phenotype file for pre-computed weights
(used when --case-confidence file)
Pre-computed weights via phenotype file (primary use case)
LIMS_ID,case_status,case_weight
LB24-1234,1,1.0
LB21-5678,1,0.7
LB24-9012,1,0.4
LB24-0000,0,1.0When --case-confidence file --case-confidence-column case_weight:
- Cases retain their binary case/control assignment (for Fisher's and reporting)
- The weight column is loaded as
sample_weights: np.ndarrayand passed to regression-based tests
HPO match count weights (automatic)
When --case-confidence count:
- Weight =
len(sample_hpo_terms ∩ case_hpo_terms) / len(case_hpo_terms) - A sample matching 6/8 case HPO terms gets weight 0.75
- A sample matching 1/8 gets weight 0.125
Integration with statistical tests
| Test | Weight handling |
|---|---|
| Fisher's exact | Ignores weights (integer counts only) — emit info log |
| Logistic burden | statsmodels.Logit().fit(freq_weights=sample_weights) |
| Linear burden | statsmodels.WLS(weights=sample_weights) |
| SKAT/SKAT-O | Weights into null model fitting |
| ACAT | Works on p-values from weighted tests — no changes needed |
Default behavior
--case-confidence equal (the default) produces sample_weights = np.ones(n_samples), preserving exact backward compatibility.
Implementation sketch
- Add
sample_weights: np.ndarray | Nonefield toAssociationConfig - Extend
compute_phenotype_based_case_control_assignment()to optionally return weights - Add weight loading from phenotype file column in
phenotype.py - Pass
freq_weightstoLogisticBurdenTest.run()andLinearBurdenTest.run() - Pass weights to SKAT null model
- Report effective sample size (
sum(weights)^2 / sum(weights^2)) in diagnostics - Fisher's test: log info message that weights are ignored for this test type
Relationship to other features
- Builds on the v0.15.0 association framework (logistic regression, SKAT, covariates)
- Complements
--restrict-regions(Add --restrict-regions flag for BED-based region restriction #84) for multi-cohort workflows - Orthogonal to variant-level weights (
weights.py) — this is sample-level
Use case
Multi-cohort rare variant meta-analysis where phenotype quality varies across cohorts. Users pre-compute confidence scores based on their domain knowledge and provide them in the phenotype file, enabling a single weighted analysis instead of multiple tiered runs.