Skip to content
This repository was archived by the owner on Dec 1, 2023. It is now read-only.

Make use of refgenie populate plugin #1

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*/pipeline_results/*
pipeline_results/*
21 changes: 14 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# pep-cwl

This repository explores how to run PEP-formatted samples through a CWL pipeline. There are two examples: the [simple demo](/simple_demo), which just runs `wc` on a few text files as input, and a [bioinformatics_demo](/bioinformatics_demo), which runs a basic `bowtie2` alignment on some sequencing reads.
## Motivation

One common task in bioinformatics is to run a bunch of samples independently through a workflow. Often, samples are listed in a CSV sample table with one row per sample. We'd like to be able to easily run a CWL workflow on each row of the sample table.

One sample table CSV standard is [PEP](http://pep.databio.org), which specifies structure for sample metadata. This repository demonstrates how to run a PEP metadata table through a CWL pipeline.

The [simple demo](/simple_demo) runs `wc` on a few text files as input. The [bioinformatics_demo](/bioinformatics_demo), which runs `bwa` alignment on some sequencing data.

## Simple demo

Expand All @@ -12,15 +18,17 @@ Here is a [CWL tool description](simple_demo/wc-tool.cwl) that runs `wc` to coun
cwl-runner wc-tool.cwl wc-job.yml
```

This runs the workflow for one input. How can we run it across multiple samples in a CSV file? CWL has built-in scatterers, but we want to simplify things to run across a CSV sample table.

### PEP-formatted sample metadata

Our sample data is stored in a [sample table](simple_demo/file_list.csv) with two samples, each with an input file in the [data](simple_demo/data) subdirectory. This sample table along with the [config file](simple_demo/project_config.yaml) together make up a standard PEP (see [pep.databio.org](http://pep.databio.org) for formal spec).
Our sample data is stored in a [sample table](simple_demo/file_list.csv) with two rows, one per sample. Each row points to a corresponding input file in the [data](simple_demo/data) subdirectory. This sample table along with the [project config file](simple_demo/project_config.yaml) together make up a standard PEP (see [pep.databio.org](http://pep.databio.org) for formal spec).

We'd like to run our CWL workflow/tool on each of these samples, which means running it once per row in the sample table. We can accomplish this with [looper](http://looper.databio.org), which is an arbitrary command runner for PEP-formatted sample data. From a CWL perspective, looper is a *tabular scatterer* -- it will scatter a CWL workflow across each row in a sample table independently.
We'd like to run our CWL workflow/tool on each of these samples, which means running it once per row in the sample table. We can accomplish this with [looper](http://looper.databio.org). From a CWL perspective, looper is a *tabular scatterer* -- it will scatter a CWL workflow across each row in a sample table independently.

### Using looper

Looper uses a [pipeline interface](simple_demo/cwl_interface.yaml) to describe how to run `cwl-runner`. In this interface we've simply specified a `command_template:`, which looks like the above CWL command: `cwl-runner {pipeline.path} {sample.yaml_file}`. This command template uses two variables to construct the command: the `{pipeline.path}` refers to `wc-tool.cwl`, pointed to in the `path` attribute in the pipeline interface file. Looper also automatically creates a `yaml` file representing each sample, and the path is accessed with `{sample.yaml_file}`.
Looper uses a [pipeline interface](simple_demo/cwl_interface.yaml) to describe how to run `cwl-runner`. In this interface we've simply specified a `command_template:`, which looks like the above CWL command: `cwl-runner wc-tool.yaml {sample.sample_yaml_cwl}`. The `{sample.sample_yaml_cwl}` is a variable that will be populated by looper to refer to a `yaml` file looper creates for each sample. This is a result of the `looper.write_sample_yaml_cwl` item listed in the `pre_submit` section of the pipeline interface.

To run these commands, invoke `looper run`, passing the project configuration file, like this:

Expand All @@ -30,9 +38,8 @@ looper run project_config.yaml

This will run the `cwl-runner wc-tool.cwl ...` command on *each row in the sample table*. While there is also a built-in CWL approach to scatter workflows, there are a few nice things about the looper approach:

- you get all the benefits of PEP project formatting. PEPs are a completely independent specification for describing sample metadata, complete with an [independent validation platform called eido](http://eido.databio.org). PEP also provides powerful portability features like *derived attributes*, and *implied attributes*, which make it easier for you to use a single sample table that works across multiple pipelines and computing environments. PEP also provides project-level features: in a project config file, you can use *imports* to define a hierarchy of project settings, and *amendments* to design projects with similar sub-projects (such as a re-run of a particular sample table with slightly different parameters; or an exact re-run on a separate sample table). Finally, because PEP is independent, and not tied to a specific pipeline framework, your sample annotations are likely to be reusable across other pipelines; for instance, Snakemake can natively read a PEP-formatted project, so someone could take your data as input directly into a Snakemake workflow as well.

- looper provides a CLI with lots of other nice features for job management, outlined below:
- **PEP framework benefits**. PEPs are a third-party specification for describing sample metadata, complete with a [validation platform called eido](http://eido.databio.org). PEP provides powerful portability features like [derived attributes](http://pep.databio.org/en/latest/specification/#sample-modifier-derive), and [implied attributes](http://pep.databio.org/en/latest/specification/#sample-modifier-imply), which adjust sample attributes on-the-fly, and [project config imports](http://pep.databio.org/en/latest/specification/#project-modifier-import) and [project config amendments](http://pep.databio.org/en/latest/specification/#project-modifier-amend) to re-use project components and define sub-projects. Finally, because PEP is not tied to a specific pipeline framework, your sample annotations are reusable across other pipelines; for instance, Snakemake can also natively read a PEP-formatted project.
- looper provides a CLI with lots of other nice features for job management, outlined in the [looper docs](http://looper.databio.org/en/latest/features/).

## Bioinformatics demo

Expand Down
136 changes: 136 additions & 0 deletions bioinformatics_demo/bwa-tool.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool

requirements:
DockerRequirement:
dockerPull: "quay.io/biocontainers/bwa:0.7.17--hed695b0_7"

inputs:
InputFile1:
type: File
inputBinding:
position: 201

InputFile2:
type: File
inputBinding:
position: 202

Index:
type: File
inputBinding:
position: 200
secondaryFiles:
- .amb
- .ann
- .bwt
- .pac
- .sa

#Optional arguments

Threads:
type: int?
inputBinding:
prefix: "-t"

MinSeedLen:
type: int?
inputBinding:
prefix: "-k"

BandWidth:
type: int?
inputBinding:
prefix: "-w"

ZDropoff:
type: int?
inputBinding:
prefix: "-d"

SeedSplitRatio:
type: float?
inputBinding:
prefix: "-r"

MaxOcc:
type: int?
inputBinding:
prefix: "-c"

MatchScore:
type: int?
inputBinding:
prefix: "-A"

MmPenalty:
type: int?
inputBinding:
prefix: "-B"

GapOpenPen:
type: int?
inputBinding:
prefix: "-O"

GapExtPen:
type: int?
inputBinding:
prefix: "-E"

ClipPen:
type: int?
inputBinding:
prefix: "-L"

UnpairPen:
type: int?
inputBinding:
prefix: "-U"

RgLine:
type: string?
inputBinding:
prefix: "-R"

VerboseLevel:
type: int?
inputBinding:
prefix: "-v"

isOutSecAlign:
type: boolean?
inputBinding:
prefix: "-a"

isMarkShortSplit:
type: boolean?
inputBinding:
prefix: "-M"

isUseHardClip:
type: boolean?
inputBinding:
prefix: "-H"

isMultiplexedPair:
type: boolean?
inputBinding:
prefix: "-p"


baseCommand: [bwa, mem]

stdout: unsorted_reads.sam

outputs:
reads_stdout:
type: stdout

$namespaces:
edam: http://edamontology.org/
$schemas:
- http://edamontology.org/EDAM_1.18.owl
12 changes: 12 additions & 0 deletions bioinformatics_demo/bwa_cwl_interface.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
pipeline_name: bwa_alignment
pipeline_type: sample
input_schema: bwa_input_schema.yaml
var_templates:
main: "{looper.piface_dir}/bwa-tool.cwl"
refgenie_config: "$REFGENIE"
pre_submit:
python_functions:
- refgenconf.looper_refgenie_populate
- looper.write_sample_yaml_cwl
command_template: >
cwl-runner {pipeline.var_templates.main} {sample.sample_yaml_cwl}
40 changes: 40 additions & 0 deletions bioinformatics_demo/bwa_input_schema.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
description: A PEP for NGS samples being aligned using pep-pypiper bowtie2 pipeline
imports:
- http://schema.databio.org/pep/2.0.0.yaml
properties:
samples:
type: array
items:
type: object
properties:
sample_name:
type: string
description: "Name of the sample"
organism:
type: string
description: "Organism"
genome:
type: string
description: "Refgenie genome registry identifier"
InputFile1:
type: string
description: "Fastq file for read 1"
InputFile2:
type: string
description: "Fastq file for read 2 (for paired-end experiments)"
Index:
type: string
description: Path to bwa index file folder
required:
- sample_name
- protocol
- InputFile1
- genome
required_files:
- InputFile1
files:
- InputFile1
- InputFile2
- Index
required:
- samples
4 changes: 2 additions & 2 deletions bioinformatics_demo/demo_sample_table.csv
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
sample_name,protocol,organism,read1,read2
sample1,bowtie2,human,FQ1,FQ2
sample2,bowtie2,human,FQ1,FQ2
sample1,RNA-seq,human,FQ1,FQ2
sample2,RNA-seq,human,FQ1,FQ2
15 changes: 10 additions & 5 deletions bioinformatics_demo/pep_bio.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,22 @@ pep_version: 2.0.0
sample_table: demo_sample_table.csv
sample_modifiers:
append:
pipeline_interfaces: bt2_cwl_interface.yaml
Index: RG1
pipeline_interfaces: bwa_cwl_interface.yaml
duplicate:
read1: InputFile1
read2: InputFile2
derive:
attributes: [read1, read2]
attributes: [read1, read2, Index, InputFile1, InputFile2]
sources:
FQ1: "bioinformatics_demo/data/{sample_name}_1.fq.gz"
FQ2: "bioinformatics_demo/data/{sample_name}_2.fq.gz"
FQ1: "data/{sample_name}_1.fq.gz"
FQ2: "data/{sample_name}_2.fq.gz"
RG1: "refgenie://{genome}/bwa_index"
imply:
- if:
organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
then:
genome: hg38
genome: t7

looper:
output_dir: pipeline_results
Loading