CenterForMedicalGeneticsGhent · matthdsm · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -7,10 +7,11 @@ This document provides project-specific instructions and architectural context f
 WisecondorX is a bioinformatics tool for Copy Number Variation (CNV) analysis in whole-genome sequencing data. It is an evolution of the original WISECONDOR algorithm, optimized for speed and accuracy.
 
 ### Core Stack
-- **Language:** Python 3.10+
-- **Data Science:** NumPy, SciPy, Pandas, Scikit-learn
+- **Language:** Python 3.12+
+- **Data Science:** NumPy, SciPy, Pandas, Scikit-learn, Matplotlib
 - **Bioinformatics:** Pysam
-- **Supporting Logic:** R (specifically for Circular Binary Segmentation (CBS))
+- **CLI:** Typer
+- **Supporting Logic:** R (CBS analysis via `DNAcopy` from Bioconductor, `jsonlite`, `png`)
 - **Environment & Dependency Management:** [Pixi](https://pixi.sh)
 - **Linting & Formatting:** Ruff (Line length: 79)
 - **Testing:** Pytest (using `unittest` style tests)
@@ -20,17 +21,23 @@ WisecondorX is a bioinformatics tool for Copy Number Variation (CNV) analysis in
 ### Pipeline Stages
 1. **Convert:** BAM/CRAM files are converted into binned read counts stored in compressed NumPy (`.npz`) files.
 2. **Newref:** Multiple `.npz` files from control samples are combined to create a reference `.npz` file. This involves PCA and gender modeling.
-3. **Predict:** A test sample `.npz` is compared against a reference `.npz` to detect CNVs.
+3. **Gender:** (Optional) Predicts the sex of a sample based on the reference model.
+4. **Predict:** A test sample `.npz` is compared against a reference `.npz` to detect CNVs using Circular Binary Segmentation (CBS).
+5. **RefQC:** (Supplementary) Quality control for a generated reference.
 
 ### Data Flow & "Contracts"
 The project relies heavily on `.npz` files for intermediate storage. These files have strict "contracts" regarding their keys and data structures.
 - **Contract Validation:** When modifying data processing logic, ensure that the `.npz` output remains compatible with subsequent stages. Refer to `tests/test_newref_contract.py` and `tests/test_predict_contract.py` for examples of these invariants.
 - **Intermediate Files:** Creation of references often involves temporary "prep" and "part" files.
 
 ### CLI Structure
-- `src/wisecondorx/main.py`: The main entry point using `argparse` for subcommands.
-- `*_control.py`: Higher-level coordination of pipeline stages.
-- `*_tools.py`: Lower-level computational logic.
+- `src/wisecondorx/main.py`: The main entry point using `typer` for subcommands.
+- `src/wisecondorx/convert.py`: `convert` subcommand.
+- `src/wisecondorx/newref.py`: `newref` subcommand.
+- `src/wisecondorx/predict.py`: `predict` and `gender` subcommands.
+- `src/wisecondorx/refqc.py`: `refqc` subcommand.
+- `src/wisecondorx/utils.py`: Helper functions used by the pipeline stages.
+- `src/wisecondorx/plotter.py`: Plotting functions for the pipeline stages.
 - `src/wisecondorx/include/`: Contains R script `CBS.R` (specifically for Circular Binary Segmentation (CBS)) invoked via Python's `subprocess`.
 
 ## Development Workflow
@@ -40,21 +47,21 @@ Use `pixi run` for all standard development tasks to ensure environment consiste
 - `pixi run lint`: Check code quality with Ruff.
 - `pixi run format`: Format code with Ruff.
 - `pixi run test`: Run the test suite.
-- `pixi run hooks`: Run pre-commit hooks (managed by `prek`).
+- `pixi run docker-build`: Build the Docker image.
 
 ### Code Style
 - Adhere to the Ruff configuration in `pyproject.toml`.
 - **Line Length:** 79 characters.
 - **Import Sorting:** Ruff handles this; ensure it's run before committing.
+- **String Formatting:** Use f-strings instead of `.format()` and limit outputs to 3 decimals where applicable.
 
 ### Testing Guidelines
 - **Contract Tests:** When adding features that change how data is stored or processed, update the contract tests in `tests/`.
-- **Functional Tests:** Aim for high coverage of the computational logic in `*_tools.py`.
+- **Functional Tests:** Aim for high coverage of the computational logic.
 - **R Integration:** Be mindful that changes to R scripts may require updates to the Python wrapper logic that calls them.
 
 ## Gemini CLI Specific Instructions
 
 - **Environment Awareness:** Always prefer `pixi run <command>` over direct execution if a pixi task exists.
 - **Binary Files:** When investigating issues related to `.npz` files, use `numpy.load` within a script or via `run_shell_command` to inspect the contents.
-- **Surgical Updates:** When fixing bugs, prioritize fixing the underlying logic in `*_tools.py` and adding a corresponding test case.
-- **Context Management:** The codebase uses multiple small modules. When working on a specific stage (e.g., `predict`), read both the `_control.py` and `_tools.py` for that stage to understand the full context.
+- **Surgical Updates:** When fixing bugs, prioritize fixing the underlying logic lower level functions with minimal code changes and adding a corresponding test case.
diff --git a/README.md b/README.md
@@ -84,8 +84,6 @@ WisecondorX convert input.bam/cram output.npz [--optional arguments]
 | `--binsize x`                  | Size per bin in bp; the reference bin size should be a multiple of this value. Note that this parameter does not impact the resolution, yet it can be used to optimize processing speed (default: x=5e3) |
 | `--normdup`                    | Use this flag to avoid duplicate removal                                                                                                                                                                 |
 
-&rarr; Bash recipe at `docs/include/pipeline/convert.sh`
-
 ### Stage (2) Create reference
 
 ```bash
@@ -102,8 +100,6 @@ WisecondorX newref reference_input_dir/*.npz reference_output.npz [--optional ar
 | `--plotyfrac x`                | plots Y read fraction histogram and Gaussian mixture fit to file x, can help when setting `--yfrac` manually; software quits after plotting |
 | `--cpus x`                     | Number of threads requested (default: x=1)                                                                                                  |
 
-&rarr; Bash recipe at `docs/include/pipeline/newref.sh`
-
 ### Stage (3) Predict copy number alterations
 
 ```bash
@@ -129,8 +125,6 @@ WisecondorX predict test_input.npz reference_input.npz output_id [--optional arg
 
 <sup>**(\*)** At least one of these output formats should be selected</sup>
 
-&rarr; Bash recipe at `docs/include/pipeline/predict.sh`
-
 ### Additional functionality
 
 ```bash

diff --git a/docs/include/pipeline/convert.sh b/docs/include/pipeline/convert.sh
diff --git a/docs/include/pipeline/newref.sh b/docs/include/pipeline/newref.sh
diff --git a/docs/include/pipeline/predict.sh b/docs/include/pipeline/predict.sh
diff --git a/pixi.lock b/pixi.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -24,7 +24,7 @@ classifiers = [
     "Topic :: Scientific/Engineering",
     "Topic :: Scientific/Engineering :: Bio-Informatics",
 ]
-requires-python = ">=3.10"
+requires-python = ">=3.12"
 dependencies = [
     "scipy>=1.17",
     "scikit-learn>=1.8",
@@ -57,7 +57,7 @@ version = {attr = "wisecondorx.__version__"}
 
 [tool.ruff]
 line-length = 79
-target-version = "py310"
+target-version = "py312"
 
 [tool.ruff.lint]
 ignore = ["E266", "E501", "C901"]
@@ -68,7 +68,7 @@ channels = ["conda-forge", "bioconda"]
 platforms = ["linux-64", "osx-arm64"]
 
 [tool.pixi.dependencies]
-python = ">=3.10"
+python = ">=3.12"
 pysam = "0.23.*"
 scipy = "1.17.*"
 scikit-learn = "1.8.*"

diff --git a/src/wisecondorx/convert.py b/src/wisecondorx/convert.py
@@ -0,0 +1,163 @@
+# WisecondorX
+
+import logging
+
+import sys
+
+import numpy as np
+import pysam
+import typer
+from pathlib import Path
+
+
+from typing import Annotated
+
+
+def wcx_convert(
+    infile: Annotated[
+        Path,
+        typer.Argument(
+            help="aligned reads input for conversion (.bam or .cram)",
+            exists=True,
+            file_okay=True,
+            dir_okay=False,
+            readable=True,
+            resolve_path=True,
+        ),
+    ],
+    prefix: Annotated[Path, typer.Argument(help="Output prefix")],
+    reference: Annotated[
+        Path,
+        typer.Option(
+            "-r",
+            "--reference",
+            help="Fasta reference to be used during cram conversion",
+            exists=True,
+            file_okay=True,
+            dir_okay=False,
+            readable=True,
+            resolve_path=True,
+        ),
+    ] = None,
+    binsize: Annotated[
+        int, typer.Option("--binsize", help="Bin size (bp)")
+    ] = 5000,
+    normdup: Annotated[
+        bool, typer.Option("--normdup", help="Do not remove duplicates")
+    ] = False,
+) -> None:
+    """
+    Convert and filter aligned reads to .npz format.
+    """
+
+    # check if infile exists and has an index
+    if not (infile.exists() and infile.is_file()):
+        logging.error(f"Input file {infile} does not exist or is not a file.")
+        sys.exit(1)
+
+    logging.info("Importing data ...")
+
+    reads_per_chromosome_bin: dict[str, np.ndarray] = {
+        str(i): None for i in range(1, 25)
+    }
+
+    reads_seen = 0
+    reads_kept = 0
+    reads_mapq = 0
+    reads_rmdup = 0
+    reads_pairf = 0
+
+    logging.info("Converting aligned reads")
+
+    mode = "rb" if infile.suffix == ".bam" else "rc"
+    ref_filename = str(reference) if reference else None
+
+    if mode == "rc" and not ref_filename:
+        logging.error(
+            "Cram inputs need a reference fasta provided through the '--reference' flag."
+        )
+        sys.exit(1)
+
+    with pysam.AlignmentFile(
+        infile, mode, reference_filename=ref_filename
+    ) as reads_file:
+        for index, chr in enumerate(reads_file.references):
+            larp = -1
+            larp2 = -1
+            chr_name = chr
+            if chr_name[:3].lower() == "chr":
+                chr_name = chr_name[3:]
+
+            if chr_name == "X":
+                chr_name = "23"
+            elif chr_name == "Y":
+                chr_name = "24"
+
+            if chr_name not in reads_per_chromosome_bin:
+                continue
+
+            logging.info(
+                f"Working at {chr}; processing {int(reads_file.lengths[index] / float(binsize) + 1)} bins"
+            )
+            counts = np.zeros(
+                int(reads_file.lengths[index] / float(binsize) + 1),
+                dtype=np.int32,
+            )
+            bam_chr = reads_file.fetch(chr)
+
+            for read in bam_chr:
+                if read.is_paired:
+                    if not read.is_proper_pair:
+                        reads_pairf += 1
+                        continue
+                    if (
+                        not normdup
+                        and larp == read.pos
+                        and larp2 == read.next_reference_start
+                    ):
+                        reads_rmdup += 1
+                    else:
+                        if read.mapping_quality >= 1:
+                            location = read.pos / binsize
+                            counts[int(location)] += 1
+                        else:
+                            reads_mapq += 1
+
+                    larp2 = read.next_reference_start
+                    reads_seen += 1
+                    larp = read.pos
+                else:
+                    if not normdup and larp == read.pos:
+                        reads_rmdup += 1
+                    else:
+                        if read.mapping_quality >= 1:
+                            location = read.pos / binsize
+                            counts[int(location)] += 1
+                        else:
+                            reads_mapq += 1
+
+                    reads_seen += 1
+                    larp = read.pos
+
+            reads_per_chromosome_bin[chr_name] = counts
+            reads_kept += sum(counts)
+
+        qual_info: dict[str, int] = {
+            "mapped": reads_file.mapped,
+            "unmapped": reads_file.unmapped,
+            "no_coordinate": reads_file.nocoordinate,
+            "filter_rmdup": reads_rmdup,
+            "filter_mapq": reads_mapq,
+            "pre_retro": reads_seen,
+            "post_retro": reads_kept,
+            "pair_fail": reads_pairf,
+        }
+
+    np.savez_compressed(
+        Path(f"{prefix}.npz"),
+        binsize=binsize,
+        reads_per_bin=reads_per_chromosome_bin,
+        quality=qual_info,
+    )
+
+    logging.info("Finished conversion")