Skip to content

Nextflow pipeline to generate data for nf-core/test-datasets gwas branch #1610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: gwas
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Nextflow logs and metadata
*.nextflow.log*
/.nextflow/

# SLURM or scheduler logs
test-datasets-*.out
test-datasets-*.err

# Shell scripts (optional, if not versioning run.sh)
*.sh

# Work and results/vcfs
/work/
/results/vcfs/

# Nextflow temporary execution files
*.command.*
*.Rout
*.tmp
36 changes: 21 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,26 +23,32 @@ git clone -b gwas --single-branch [email protected]:USERNAME/test-datasets.git

## Documentation

nf-core/test-datasets comes with documentation in the `docs/` directory and scripts to generate the example data in the `scripts/` directory.
This test data comes from the 1000 Genomes Project phase3 release of variant calls. VCF files have been 'chunked' to include only the first 4,500 variants to reduce file sizes. Chromosome Y is excluded. Please see the datasets [README](https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/README_phase3_callset_20150220) for more details. Covariates and phenotypes were randomly generated for each sample in the VCF.

nf-core/test-datasets comes with documentation in the `docs/` directory and the data can be generated running main.nf.

## Example data organisation
nf-core/test-datasets generated test data is located in the `data/` directory.
nf-core/test-datasets generated test data is located in the `results/` directory and includes the following structure.

```
.
├── data_phenotypes_and_covariates
│   ├── example1.covar
│   └── example1.pheno
├── data_shrink_chunk_4500
│   ├── chr10.vcf.bgz
│   ├── chr10.vcf.bgz.tbi
│   ├── chr11.vcf.bgz
│   ├── chr11.vcf.bgz.tbi
└── data_shrink_combined_4500
├── chr1_to_22_and_X.vcf.bgz
└── chr1_to_22_and_X.vcf.bgz.tbi
results/
├── chunked_vcfs/
│   ├── chr1_chunked.vcf.gz
│   ├── chr1_chunked.vcf.gz.tbi
│   ├── chr2_chunked.vcf.gz
│   ├── chr2_chunked.vcf.gz.tbi
│   ├── ...
│   ├── chrX_chunked.vcf.gz
│   ├── chrX_chunked.vcf.gz.tbi
│   ├── combined_chunked.vcf.gz
│   └── combined_chunked.vcf.gz.tbi
├── pheno_cov/
│   ├── example.pheno
│   └── example.covar

```
Each chromosome-specific VCF file (chr*.vcf.gz) is accompanied by its corresponding tabix index (.vcf.gz.tbi), enabling efficient querying. A combined VCF and index are also included for downstream association tests or visualization.


## Support

Expand Down
Binary file removed data/data_shrink_chunk_4500/chr1.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr1.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr10.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr10.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr11.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr11.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr12.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr12.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr13.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr13.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr14.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr14.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr15.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr15.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr16.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr16.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr17.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr17.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr18.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr18.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr19.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr19.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr2.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr2.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr20.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr20.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr21.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr21.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr22.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr22.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr3.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr3.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr4.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr4.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr5.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr5.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr6.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr6.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr7.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr7.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr8.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr8.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr9.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chr9.vcf.bgz.tbi
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chrX.vcf.bgz
Binary file not shown.
Binary file removed data/data_shrink_chunk_4500/chrX.vcf.bgz.tbi
Binary file not shown.
Binary file not shown.
29 changes: 29 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { GENERATE_EXAMPLE_GENOTYPES_VCFS } from './modules/generate_example_genotypes_vcfs.nf'
include { CHUNK_VCFS } from './modules/chunk_vcfs.nf'
include { CONCAT_CHUNKED_VCFS } from './modules/concat_chunked_vcfs.nf'
include { EXTRACT_SAMPLE_IDS } from './modules/extract_sample_ids.nf'
include { GENERATE_PHENO_COV } from './modules/generate_pheno_cov.nf'
include { INDEX_CHUNKED_VCFS } from './modules/index_chunked_vcfs.nf'
workflow {
// Run the download process
GENERATE_EXAMPLE_GENOTYPES_VCFS()

def vcfs_with_chr = GENERATE_EXAMPLE_GENOTYPES_VCFS.out.vcfs
.flatten()
.map { file ->
def chr = file.name.toString().split("\\.")[1] // safer than `tokenize`
tuple(chr, file)
}

// Feed the tuples into the chunking process
CHUNK_VCFS(vcfs_with_chr)
CHUNK_VCFS.out.chunked_vcfs.collect().set {all_chunked_vcfs}
CONCAT_CHUNKED_VCFS(all_chunked_vcfs)
chr1_ch = channel.fromPath('./results/chunked_vcfs/chr1_chunked.vcf.gz')
EXTRACT_SAMPLE_IDS(chr1_ch)
GENERATE_PHENO_COV(EXTRACT_SAMPLE_IDS.out.sample_ids)
INDEX_CHUNKED_VCFS(CHUNK_VCFS.out.chunked_vcfs)
}
16 changes: 16 additions & 0 deletions modules/chunk_vcfs.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
process CHUNK_VCFS {
container "community.wave.seqera.io/library/bcftools_tabix_pip_tools:48085064a9189d8c"
publishDir params.outdir_chunked_vcfs, mode: 'copy'

input:
tuple val(chr), path(vcfs)

output:
path("${chr}_chunked.vcf.gz"), emit: chunked_vcfs

script:
"""
bcftools view ${vcfs} | awk 'BEGIN {h=1; n=4500} /^#/ {print; next} {if (h <= n) {print; h++}}' | bgzip >${chr}_chunked.vcf.gz
tabix -p vcf ${chr}_chunked.vcf.gz
"""
}
20 changes: 20 additions & 0 deletions modules/concat_chunked_vcfs.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
process CONCAT_CHUNKED_VCFS {
container "community.wave.seqera.io/library/bcftools_tabix_pip_tools:48085064a9189d8c"
publishDir params.outdir_chunked_vcfs, mode: 'copy'

input:
path vcf_files

output:
path "combined_chunked.vcf.gz"
path "combined_chunked.vcf.gz.tbi"

script:
"""
echo "VCFs to concat:" > concat_debug.txt
ls -lh ${vcf_files} >> concat_debug.txt

bcftools concat -Oz -o combined_chunked.vcf.gz ${vcf_files.join(' ')}
tabix -p vcf combined_chunked.vcf.gz
"""
}
15 changes: 15 additions & 0 deletions modules/extract_sample_ids.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
process EXTRACT_SAMPLE_IDS {
container "community.wave.seqera.io/library/r-base:4.4.3--1e564c44feffeaa0"
publishDir params.outdir_pheno_cov, mode: 'symlink'

input:
path vcf_file

output:
path "sample_ids.txt", emit: sample_ids

script:
"""
zcat $vcf_file | grep '#CHROM' | cut -f10- | tr '\t' '\n' > sample_ids.txt
"""
}
16 changes: 16 additions & 0 deletions modules/generate_example_genotypes_vcfs.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
process GENERATE_EXAMPLE_GENOTYPES_VCFS {
container "community.wave.seqera.io/library/bcftools_tabix_pip_tools:48085064a9189d8c"
publishDir params.outdir_vcfs, mode: 'symlink'

output:
path "*.vcf.gz", emit: vcfs

script:
"""
for chr in {1..22}; do
fname="ALL.chr\${chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
curl -O https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\$fname
done
curl -O https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1c.20130502.genotypes.vcf.gz
"""
}
38 changes: 38 additions & 0 deletions modules/generate_pheno_cov.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
process GENERATE_PHENO_COV {
container "community.wave.seqera.io/library/r-base:4.4.3--1e564c44feffeaa0"
publishDir params.outdir_pheno_cov, mode: 'copy'

input:
path sample_ids

output:
path "example.pheno"
path "example.covar"

script:
"""
#!/usr/bin/env Rscript
#make a phenotype
#Here, a not too bad tutorial on different techniques on how to simulate data
# https://aosmith.rbind.io/2018/08/29/getting-started-simulating-data/

#Here I used the blog's proposed way of simulating data for a regression analysis
#We will use the generated data slightly different, but hopefully good enough to
# actually get some results
ids <- readLines("${sample_ids}")
n <- length(ids)
set.seed(16)
y = rnorm(n = n, mean = 0, sd = 1)
x1 = runif(n = n, min = 1, max = 2)
x2 = runif(n = n, min = 200, max = 300)

# Write this to one phenodata file and one covardata file
# first column, unique ids, second column family ids, remaining columns are
# phenotyp or covariate columns (here individual IDs are family IDs)
example.pheno <- data.frame(ids=ids, fam=ids, pheno=y)
example.covar <- data.frame(ids=ids, fam=ids, cov1=x1, cov2=x2)
# Write to tab-delimited files without headers or row names
write.table(example.pheno, file = "example.pheno", sep = "\\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
write.table(example.covar, file = "example.covar", sep = "\\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
"""
}
15 changes: 15 additions & 0 deletions modules/index_chunked_vcfs.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
process INDEX_CHUNKED_VCFS {
container "community.wave.seqera.io/library/bcftools_tabix_pip_tools:48085064a9189d8c"
publishDir params.outdir_chunked_vcfs, mode: 'copy'

input:
path vcf_files

output:
path "*.vcf.gz.tbi", emit: indexed_vcfs

script:
"""
tabix -p vcf ${vcf_files}
"""
}
11 changes: 11 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
params {
outdir_base = "results"
outdir_vcfs = "${params.outdir_base}/vcfs"
outdir_chunked_vcfs = "${params.outdir_base}/chunked_vcfs"
outdir_pheno_cov = "${params.outdir_base}/pheno_cov"
}

singularity {
enabled = true
autoMounts = true
}
Binary file added results/chunked_vcfs/chr10_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr10_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr11_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr11_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr12_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr12_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr13_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr13_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr14_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr14_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr15_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr15_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr16_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr16_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr17_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr17_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr18_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr18_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr19_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr19_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr1_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr1_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr20_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr20_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr21_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr21_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr22_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr22_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr2_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr2_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr3_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr3_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr4_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr4_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr5_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr5_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr6_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr6_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr7_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr7_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr8_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr8_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chr9_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chr9_chunked.vcf.gz.tbi
Binary file not shown.
Binary file added results/chunked_vcfs/chrX_chunked.vcf.gz
Binary file not shown.
Binary file added results/chunked_vcfs/chrX_chunked.vcf.gz.tbi
Binary file not shown.
Binary file not shown.
Binary file added results/chunked_vcfs/combined_chunked.vcf.gz.tbi
Binary file not shown.
1 change: 1 addition & 0 deletions results/pheno_cov/sample_ids.txt
42 changes: 0 additions & 42 deletions scripts/generate-example-data-pheno-and-covar.R

This file was deleted.

36 changes: 0 additions & 36 deletions scripts/generate-example-genotype-vcfs.sh

This file was deleted.