-
Notifications
You must be signed in to change notification settings - Fork 0
Unify Gowinda enrichment rules #6
Description
My initial approach is trying to create method specific outputs so they all share the same format, so we can generate unified gowinda enrichment analysis outputs. However, because of the report wrapper I don't think we can unify those rules (HTML/plot rules), since we need need wildcards for labels/titles.
So here is my current suggestion:
prep_*_enrichment (method-specific)
│
▼
candidate.snps.tsv ← standardized format: CHROM\tPOS
total.snps.tsv ← standardized format: CHROM\tPOS
│
▼
run_gowinda (generic, 1 rule)
│
▼
gowinda.enrichment.tsv
│
▼
existing HTML/plot rules (unchanged)
Example:
current rule in enrichment_selscan_gowinda in 4.1_positive_selection_selscan.smk
rule enrichment_selscan_gowinda:
input:
gowinda=...,
go2gene=...,
gtf=...,
candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
total=expand("...{ppl}.chr{i}.biallelic.snps.vcf.gz", ...),
output:
candidate_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
total_snps="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
enrichment="...{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
shell:
"""
# SNP extraction (method-specific)
sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps}
for i in {input.total}; do
bcftools query -f "%CHROM\\t%POS\\n" $i
done > {output.total_snps}
# GOWInDA execution (identical across all methods)
java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
--snp-file {output.total_snps} \
...
"""
rule prep_selscan_enrichment:
input:
candidates=rules.annotate_selscan_candidates.output.annotated_candidates,
total=expand(
"results/polarized_data/{species}/1pop/{ppl}/{ppl}.chr{i}.biallelic.snps.vcf.gz",
i=main_config["chromosomes"],
allow_missing=True,
),
output:
candidate_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.candidate.snps.tsv",
total_snps="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.total.snps.tsv",
log:
"logs/positive_selection/prep_selscan_enrichment.{species}.{ppl}.{method}.maf_{maf}.top_{cutoff}.log",
conda:
"../envs/selscape-env.yaml"
shell:
"""
sed '1d' {input.candidates} | awk '{{print "chr"$1"\\t"$2}}' > {output.candidate_snps} 2> {log}
for i in {input.total}; do
bcftools query -f "%CHROM\\t%POS\\n" $i
done > {output.total_snps} 2>> {log}
"""
adding a generic GOWInDA rule
rule run_gowinda:
input:
gowinda=rules.download_gowinda.output.gowinda,
go2gene=rules.convert_ncbi_go.output.go2gene,
gtf=rules.convert_ncbi_gtf.output.gtf,
candidate_snps="{prefix}.candidate.snps.tsv",
total_snps="{prefix}.total.snps.tsv",
output:
enrichment="{prefix}.gowinda.enrichment.tsv",
resources:
mem_gb=32,
cpus=8,
log:
"logs/enrichment/run_gowinda.{prefix}.log",
conda:
"../envs/selscape-env.yaml"
shell:
"""
java -Xmx{resources.mem_gb}g -jar {input.gowinda} \
--snp-file {input.total_snps} \
--candidate-snp-file {input.candidate_snps} \
--gene-set-file {input.go2gene} \
--annotation-file {input.gtf} \
--simulations 1000000 \
--min-significance 1 \
--gene-definition gene \
--threads {resources.cpus} \
--output-file {output.enrichment} \
--mode gene \
--min-genes 1 > {log} 2>&1 || true
sed -i '1iGO_ID\\tavg_genes_sim\\tgenes_found\\tp_value\\tp_adjusted\\tgenes_uniq\\tgenes_max\\tgenes_total\\tdescription\\tgene_list' {output.enrichment} 2>> {log}
"""
The {prefix} wildcard matches everything before the suffix (e.g., .candidate.snps.tsv), so Snakemake can use one generic rule to process files from any method regardless of their path structure.
The downstream rules use Snakemake's report() wrapper which requires method-specific wildcards:
rule selscan_enrichment_results_table_html:
input:
tsv="results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.tsv",
output:
html=report(
"results/positive_selection/selscan/{species}/1pop/{ppl}/{method}_{maf}/{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.gowinda.enrichment.html",
category="Positive Selection",
subcategory="{method}", # ← needs wildcards
labels=lambda wildcards: selscan_labels(
wildcards, type="Enrichment Table"
), # ← needs wildcards
),
params:
title=lambda w: f"{w.ppl} - {w.method.upper()} Enrichment", # ← needs wildcards
log:
"logs/positive_selection/selscan_enrichment_results_table_html.{species}.{ppl}.normalized.{method}.maf_{maf}.top_{cutoff}.log",
conda:
"../envs/selscape-env.yaml"
script:
"../scripts/tsv2html.R"