Unify Gowinda enrichment rules

My initial approach is trying to create method specific outputs so they all share the same format, so we can generate unified gowinda enrichment analysis outputs. However, because of the report wrapper I don't think we can unify those rules (HTML/plot rules), since we need need wildcards for labels/titles.


So here is my current suggestion:

prep_*_enrichment (method-specific)
         │
         ▼
   candidate.snps.tsv   ← standardized format: CHROM\tPOS
   total.snps.tsv       ← standardized format: CHROM\tPOS
         │
         ▼
   run_gowinda (generic, 1 rule)
         │
         ▼
   gowinda.enrichment.tsv
         │
         ▼
   existing HTML/plot rules (unchanged)


Example: 
current rule in  `enrichment_selscan_gowinda` in `4.1_positive_selection_selscan.smk`


```
rule enrichment_selscan_gowinda:
    input:
        gowinda=...,
        go2gene=...,
        gtf=...,
        candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
        total=expand("...{ppl}.chr{i}.biallelic.snps.vcf.gz", ...),
    output:
        candidate_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
        total_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
        enrichment="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
    shell:
        """
        # SNP extraction (method-specific)
        sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps}
        for i in {input.total}; do
            bcftools query -f "%CHROM\\t%POS\\n" $i
        done > {output.total_snps}

        # GOWInDA execution (identical across all methods)
        java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
            --snp-file {output.total_snps} \
            ...
        """
```

```
rule prep_selscan_enrichment:
    input:
        candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
        total=expand(
            "results/polarized_data/{species}/1pop/{ppl}/{ppl}.chr{i}.biallelic.snps.vcf.gz",
            i=main_config["chromosomes"],
            allow_missing=True,
        ),
    output:
        candidate_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
        total_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
    log:
        "logs/positive_selection/prep_selscan_enrichment.{species}.{ppl}.{method}.maf_{maf}.top_{cutoff}.log",
    conda:
        "../envs/selscape-env.yaml"
    shell:
        """
        sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps} 2> {log}

        for i in {input.total}; do
            bcftools query -f "%CHROM\\t%POS\\n" $i
        done > {output.total_snps} 2>> {log}
        """
```

adding a generic GOWInDA rule

```
rule run_gowinda:
    input:
        gowinda=rules.download_gowinda.output.gowinda,
        go2gene=rules.convert_ncbi_go.output.go2gene,
        gtf=rules.convert_ncbi_gtf.output.gtf,
        candidate_snps="{prefix}.candidate.snps.tsv",
        total_snps="{prefix}.total.snps.tsv",
    output:
        enrichment="{prefix}.gowinda.enrichment.tsv",
    resources:
        mem_gb=32,
        cpus=8,
    log:
        "logs/enrichment/run_gowinda.{prefix}.log",
    conda:
        "../envs/selscape-env.yaml"
    shell:
        """
        java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
            --snp-file {input.total_snps} \
            --candidate-snp-file {input.candidate_snps} \
            --gene-set-file {input.go2gene} \
            --annotation-file {input.gtf} \
            --simulations 1000000 \
            --min-significance 1 \
            --gene-definition gene \
            --threads {resources.cpus} \
            --output-file {output.enrichment} \
            --mode gene \
            --min-genes 1 > {log} 2>&1 || true

        sed -i '1iGO_ID\\tavg_genes_sim\\tgenes_found\\tp_value\\tp_adjusted\\tgenes_uniq\\tgenes_max\\tgenes_total\\tdescription\\tgene_list' {output.enrichment} 2>> {log}
        """
```
The {prefix} wildcard matches everything before the suffix (e.g., .candidate.snps.tsv), so Snakemake can use one generic rule to process files from any method regardless of their path structure.

The downstream rules use Snakemake's report() wrapper which requires method-specific wildcards:

```
rule selscan_enrichment_results_table_html:
    input:
        tsv="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
    output:
        html=report(
            "results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.html",
            category="Positive Selection",
            subcategory="{method}",  # ← needs wildcards
            labels=lambda wildcards: selscan_labels(
                wildcards, type="Enrichment Table"
            ),  # ← needs wildcards
        ),
    params:
        title=lambda w: f"{w.ppl} - {w.method.upper()} Enrichment",  # ← needs wildcards
    log:
        "logs/positive_selection/selscan_enrichment_results_table_html.{species}.{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.log",
    conda:
        "../envs/selscape-env.yaml"
    script:
        "../scripts/tsv2html.R"
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify Gowinda enrichment rules #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unify Gowinda enrichment rules #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions