Skip to content

Unify Gowinda enrichment rules #6

@nomis-c

Description

@nomis-c

My initial approach is trying to create method specific outputs so they all share the same format, so we can generate unified gowinda enrichment analysis outputs. However, because of the report wrapper I don't think we can unify those rules (HTML/plot rules), since we need need wildcards for labels/titles.

So here is my current suggestion:

prep_*_enrichment (method-specific)


candidate.snps.tsv ← standardized format: CHROM\tPOS
total.snps.tsv ← standardized format: CHROM\tPOS


run_gowinda (generic, 1 rule)


gowinda.enrichment.tsv


existing HTML/plot rules (unchanged)

Example:
current rule in enrichment_selscan_gowinda in 4.1_positive_selection_selscan.smk

rule enrichment_selscan_gowinda:
    input:
        gowinda=...,
        go2gene=...,
        gtf=...,
        candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
        total=expand("...{ppl}.chr{i}.biallelic.snps.vcf.gz", ...),
    output:
        candidate_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
        total_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
        enrichment="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
    shell:
        """
        # SNP extraction (method-specific)
        sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps}
        for i in {input.total}; do
            bcftools query -f "%CHROM\\t%POS\\n" $i
        done > {output.total_snps}

        # GOWInDA execution (identical across all methods)
        java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
            --snp-file {output.total_snps} \
            ...
        """
rule prep_selscan_enrichment:
    input:
        candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
        total=expand(
            "results/polarized_data/{species}/1pop/{ppl}/{ppl}.chr{i}.biallelic.snps.vcf.gz",
            i=main_config["chromosomes"],
            allow_missing=True,
        ),
    output:
        candidate_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
        total_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
    log:
        "logs/positive_selection/prep_selscan_enrichment.{species}.{ppl}.{method}.maf_{maf}.top_{cutoff}.log",
    conda:
        "../envs/selscape-env.yaml"
    shell:
        """
        sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps} 2> {log}

        for i in {input.total}; do
            bcftools query -f "%CHROM\\t%POS\\n" $i
        done > {output.total_snps} 2>> {log}
        """

adding a generic GOWInDA rule

rule run_gowinda:
    input:
        gowinda=rules.download_gowinda.output.gowinda,
        go2gene=rules.convert_ncbi_go.output.go2gene,
        gtf=rules.convert_ncbi_gtf.output.gtf,
        candidate_snps="{prefix}.candidate.snps.tsv",
        total_snps="{prefix}.total.snps.tsv",
    output:
        enrichment="{prefix}.gowinda.enrichment.tsv",
    resources:
        mem_gb=32,
        cpus=8,
    log:
        "logs/enrichment/run_gowinda.{prefix}.log",
    conda:
        "../envs/selscape-env.yaml"
    shell:
        """
        java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
            --snp-file {input.total_snps} \
            --candidate-snp-file {input.candidate_snps} \
            --gene-set-file {input.go2gene} \
            --annotation-file {input.gtf} \
            --simulations 1000000 \
            --min-significance 1 \
            --gene-definition gene \
            --threads {resources.cpus} \
            --output-file {output.enrichment} \
            --mode gene \
            --min-genes 1 > {log} 2>&1 || true

        sed -i '1iGO_ID\\tavg_genes_sim\\tgenes_found\\tp_value\\tp_adjusted\\tgenes_uniq\\tgenes_max\\tgenes_total\\tdescription\\tgene_list' {output.enrichment} 2>> {log}
        """

The {prefix} wildcard matches everything before the suffix (e.g., .candidate.snps.tsv), so Snakemake can use one generic rule to process files from any method regardless of their path structure.

The downstream rules use Snakemake's report() wrapper which requires method-specific wildcards:

rule selscan_enrichment_results_table_html:
    input:
        tsv="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
    output:
        html=report(
            "results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.html",
            category="Positive Selection",
            subcategory="{method}",  # ← needs wildcards
            labels=lambda wildcards: selscan_labels(
                wildcards, type="Enrichment Table"
            ),  # ← needs wildcards
        ),
    params:
        title=lambda w: f"{w.ppl} - {w.method.upper()} Enrichment",  # ← needs wildcards
    log:
        "logs/positive_selection/selscan_enrichment_results_table_html.{species}.{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.log",
    conda:
        "../envs/selscape-env.yaml"
    script:
        "../scripts/tsv2html.R"

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions