Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 18 additions & 11 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@ This document provides project-specific instructions and architectural context f
WisecondorX is a bioinformatics tool for Copy Number Variation (CNV) analysis in whole-genome sequencing data. It is an evolution of the original WISECONDOR algorithm, optimized for speed and accuracy.

### Core Stack
- **Language:** Python 3.10+
- **Data Science:** NumPy, SciPy, Pandas, Scikit-learn
- **Language:** Python 3.12+
- **Data Science:** NumPy, SciPy, Pandas, Scikit-learn, Matplotlib
- **Bioinformatics:** Pysam
- **Supporting Logic:** R (specifically for Circular Binary Segmentation (CBS))
- **CLI:** Typer
- **Supporting Logic:** R (CBS analysis via `DNAcopy` from Bioconductor, `jsonlite`, `png`)
- **Environment & Dependency Management:** [Pixi](https://pixi.sh)
- **Linting & Formatting:** Ruff (Line length: 79)
- **Testing:** Pytest (using `unittest` style tests)
Expand All @@ -20,17 +21,23 @@ WisecondorX is a bioinformatics tool for Copy Number Variation (CNV) analysis in
### Pipeline Stages
1. **Convert:** BAM/CRAM files are converted into binned read counts stored in compressed NumPy (`.npz`) files.
2. **Newref:** Multiple `.npz` files from control samples are combined to create a reference `.npz` file. This involves PCA and gender modeling.
3. **Predict:** A test sample `.npz` is compared against a reference `.npz` to detect CNVs.
3. **Gender:** (Optional) Predicts the sex of a sample based on the reference model.
4. **Predict:** A test sample `.npz` is compared against a reference `.npz` to detect CNVs using Circular Binary Segmentation (CBS).
5. **RefQC:** (Supplementary) Quality control for a generated reference.

### Data Flow & "Contracts"
The project relies heavily on `.npz` files for intermediate storage. These files have strict "contracts" regarding their keys and data structures.
- **Contract Validation:** When modifying data processing logic, ensure that the `.npz` output remains compatible with subsequent stages. Refer to `tests/test_newref_contract.py` and `tests/test_predict_contract.py` for examples of these invariants.
- **Intermediate Files:** Creation of references often involves temporary "prep" and "part" files.

### CLI Structure
- `src/wisecondorx/main.py`: The main entry point using `argparse` for subcommands.
- `*_control.py`: Higher-level coordination of pipeline stages.
- `*_tools.py`: Lower-level computational logic.
- `src/wisecondorx/main.py`: The main entry point using `typer` for subcommands.
- `src/wisecondorx/convert.py`: `convert` subcommand.
- `src/wisecondorx/newref.py`: `newref` subcommand.
- `src/wisecondorx/predict.py`: `predict` and `gender` subcommands.
- `src/wisecondorx/refqc.py`: `refqc` subcommand.
- `src/wisecondorx/utils.py`: Helper functions used by the pipeline stages.
- `src/wisecondorx/plotter.py`: Plotting functions for the pipeline stages.
- `src/wisecondorx/include/`: Contains R script `CBS.R` (specifically for Circular Binary Segmentation (CBS)) invoked via Python's `subprocess`.

## Development Workflow
Expand All @@ -40,21 +47,21 @@ Use `pixi run` for all standard development tasks to ensure environment consiste
- `pixi run lint`: Check code quality with Ruff.
- `pixi run format`: Format code with Ruff.
- `pixi run test`: Run the test suite.
- `pixi run hooks`: Run pre-commit hooks (managed by `prek`).
- `pixi run docker-build`: Build the Docker image.

### Code Style
- Adhere to the Ruff configuration in `pyproject.toml`.
- **Line Length:** 79 characters.
- **Import Sorting:** Ruff handles this; ensure it's run before committing.
- **String Formatting:** Use f-strings instead of `.format()` and limit outputs to 3 decimals where applicable.

### Testing Guidelines
- **Contract Tests:** When adding features that change how data is stored or processed, update the contract tests in `tests/`.
- **Functional Tests:** Aim for high coverage of the computational logic in `*_tools.py`.
- **Functional Tests:** Aim for high coverage of the computational logic.
- **R Integration:** Be mindful that changes to R scripts may require updates to the Python wrapper logic that calls them.

## Gemini CLI Specific Instructions

- **Environment Awareness:** Always prefer `pixi run <command>` over direct execution if a pixi task exists.
- **Binary Files:** When investigating issues related to `.npz` files, use `numpy.load` within a script or via `run_shell_command` to inspect the contents.
- **Surgical Updates:** When fixing bugs, prioritize fixing the underlying logic in `*_tools.py` and adding a corresponding test case.
- **Context Management:** The codebase uses multiple small modules. When working on a specific stage (e.g., `predict`), read both the `_control.py` and `_tools.py` for that stage to understand the full context.
- **Surgical Updates:** When fixing bugs, prioritize fixing the underlying logic lower level functions with minimal code changes and adding a corresponding test case.
6 changes: 0 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,6 @@ WisecondorX convert input.bam/cram output.npz [--optional arguments]
| `--binsize x` | Size per bin in bp; the reference bin size should be a multiple of this value. Note that this parameter does not impact the resolution, yet it can be used to optimize processing speed (default: x=5e3) |
| `--normdup` | Use this flag to avoid duplicate removal |

&rarr; Bash recipe at `docs/include/pipeline/convert.sh`

### Stage (2) Create reference

```bash
Expand All @@ -102,8 +100,6 @@ WisecondorX newref reference_input_dir/*.npz reference_output.npz [--optional ar
| `--plotyfrac x` | plots Y read fraction histogram and Gaussian mixture fit to file x, can help when setting `--yfrac` manually; software quits after plotting |
| `--cpus x` | Number of threads requested (default: x=1) |

&rarr; Bash recipe at `docs/include/pipeline/newref.sh`

### Stage (3) Predict copy number alterations

```bash
Expand All @@ -129,8 +125,6 @@ WisecondorX predict test_input.npz reference_input.npz output_id [--optional arg

<sup>**(\*)** At least one of these output formats should be selected</sup>

&rarr; Bash recipe at `docs/include/pipeline/predict.sh`

### Additional functionality

```bash
Expand Down
22 changes: 0 additions & 22 deletions docs/include/pipeline/convert.sh

This file was deleted.

19 changes: 0 additions & 19 deletions docs/include/pipeline/newref.sh

This file was deleted.

23 changes: 0 additions & 23 deletions docs/include/pipeline/predict.sh

This file was deleted.

4 changes: 2 additions & 2 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ classifiers = [
"Topic :: Scientific/Engineering",
"Topic :: Scientific/Engineering :: Bio-Informatics",
]
requires-python = ">=3.10"
requires-python = ">=3.12"
dependencies = [
"scipy>=1.17",
"scikit-learn>=1.8",
Expand Down Expand Up @@ -57,7 +57,7 @@ version = {attr = "wisecondorx.__version__"}

[tool.ruff]
line-length = 79
target-version = "py310"
target-version = "py312"

[tool.ruff.lint]
ignore = ["E266", "E501", "C901"]
Expand All @@ -68,7 +68,7 @@ channels = ["conda-forge", "bioconda"]
platforms = ["linux-64", "osx-arm64"]

[tool.pixi.dependencies]
python = ">=3.10"
python = ">=3.12"
pysam = "0.23.*"
scipy = "1.17.*"
scikit-learn = "1.8.*"
Expand Down
163 changes: 163 additions & 0 deletions src/wisecondorx/convert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# WisecondorX

import logging

import sys

import numpy as np
import pysam
import typer
from pathlib import Path


from typing import Annotated


def wcx_convert(
infile: Annotated[
Path,
typer.Argument(
help="aligned reads input for conversion (.bam or .cram)",
exists=True,
file_okay=True,
dir_okay=False,
readable=True,
resolve_path=True,
),
],
prefix: Annotated[Path, typer.Argument(help="Output prefix")],
reference: Annotated[
Path,
typer.Option(
"-r",
"--reference",
help="Fasta reference to be used during cram conversion",
exists=True,
file_okay=True,
dir_okay=False,
readable=True,
resolve_path=True,
),
] = None,
binsize: Annotated[
int, typer.Option("--binsize", help="Bin size (bp)")
] = 5000,
normdup: Annotated[
bool, typer.Option("--normdup", help="Do not remove duplicates")
] = False,
) -> None:
"""
Convert and filter aligned reads to .npz format.
"""

# check if infile exists and has an index
if not (infile.exists() and infile.is_file()):
logging.error(f"Input file {infile} does not exist or is not a file.")
sys.exit(1)

logging.info("Importing data ...")

reads_per_chromosome_bin: dict[str, np.ndarray] = {
str(i): None for i in range(1, 25)
}

reads_seen = 0
reads_kept = 0
reads_mapq = 0
reads_rmdup = 0
reads_pairf = 0

logging.info("Converting aligned reads")

mode = "rb" if infile.suffix == ".bam" else "rc"
ref_filename = str(reference) if reference else None

if mode == "rc" and not ref_filename:
logging.error(
"Cram inputs need a reference fasta provided through the '--reference' flag."
)
sys.exit(1)
Comment on lines +72 to +79
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input format detection treats any non-.bam file as CRAM (mode = "rb" if ... else "rc"). This misclassifies unsupported extensions and produces a misleading “Cram inputs need a reference” error. Add an explicit suffix check for .bam/.cram and error out for anything else.

Copilot uses AI. Check for mistakes.

with pysam.AlignmentFile(
infile, mode, reference_filename=ref_filename
) as reads_file:
for index, chr in enumerate(reads_file.references):
larp = -1
larp2 = -1
chr_name = chr
if chr_name[:3].lower() == "chr":
chr_name = chr_name[3:]

if chr_name == "X":
chr_name = "23"
elif chr_name == "Y":
chr_name = "24"

if chr_name not in reads_per_chromosome_bin:
continue

logging.info(
f"Working at {chr}; processing {int(reads_file.lengths[index] / float(binsize) + 1)} bins"
)
counts = np.zeros(
int(reads_file.lengths[index] / float(binsize) + 1),
dtype=np.int32,
)
bam_chr = reads_file.fetch(chr)

for read in bam_chr:
if read.is_paired:
if not read.is_proper_pair:
reads_pairf += 1
continue
if (
not normdup
and larp == read.pos
and larp2 == read.next_reference_start
):
reads_rmdup += 1
else:
if read.mapping_quality >= 1:
location = read.pos / binsize
counts[int(location)] += 1
else:
reads_mapq += 1

larp2 = read.next_reference_start
reads_seen += 1
larp = read.pos
else:
if not normdup and larp == read.pos:
reads_rmdup += 1
else:
if read.mapping_quality >= 1:
location = read.pos / binsize
counts[int(location)] += 1
else:
reads_mapq += 1

reads_seen += 1
larp = read.pos

reads_per_chromosome_bin[chr_name] = counts
reads_kept += sum(counts)

qual_info: dict[str, int] = {
"mapped": reads_file.mapped,
"unmapped": reads_file.unmapped,
"no_coordinate": reads_file.nocoordinate,
"filter_rmdup": reads_rmdup,
"filter_mapq": reads_mapq,
"pre_retro": reads_seen,
"post_retro": reads_kept,
"pair_fail": reads_pairf,
}

np.savez_compressed(
Path(f"{prefix}.npz"),
binsize=binsize,
reads_per_bin=reads_per_chromosome_bin,
quality=qual_info,
)

logging.info("Finished conversion")
Loading
Loading