pepkit · nsheff · Oct 15, 2020 · Jan 26, 2021 · Jan 26, 2021 · Jan 26, 2021
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
+*/pipeline_results/*
 pipeline_results/*
diff --git a/README.md b/README.md
@@ -1,6 +1,12 @@
 # pep-cwl
 
-This repository explores how to run PEP-formatted samples through a CWL pipeline. There are two examples: the [simple demo](/simple_demo), which just runs `wc` on a few text files as input, and a [bioinformatics_demo](/bioinformatics_demo), which runs a basic `bowtie2` alignment on some sequencing reads.
+## Motivation
+
+One common task in bioinformatics is to run a bunch of samples independently through a workflow. Often, samples are listed in a CSV sample table with one row per sample. We'd like to be able to easily run a CWL workflow on each row of the sample table.
+
+One sample table CSV standard is [PEP](http://pep.databio.org), which specifies structure for sample metadata. This repository demonstrates how to run a PEP metadata table through a CWL pipeline.
+
+The [simple demo](/simple_demo) runs `wc` on a few text files as input.  The [bioinformatics_demo](/bioinformatics_demo), which runs `bwa` alignment on some sequencing data.
 
 ## Simple demo
 
@@ -12,15 +18,17 @@ Here is a [CWL tool description](simple_demo/wc-tool.cwl) that runs `wc` to coun
 cwl-runner wc-tool.cwl wc-job.yml
 ```
 
+This runs the workflow for one input. How can we run it across multiple samples in a CSV file? CWL has built-in scatterers, but we want to simplify things to run across a CSV sample table.
+
 ### PEP-formatted sample metadata
 
-Our sample data is stored in a [sample table](simple_demo/file_list.csv) with two samples, each with an input file in the [data](simple_demo/data) subdirectory. This sample table along with the [config file](simple_demo/project_config.yaml) together make up a standard PEP (see [pep.databio.org](http://pep.databio.org) for formal spec).
+Our sample data is stored in a [sample table](simple_demo/file_list.csv) with two rows, one per sample. Each row points to a corresponding input file in the [data](simple_demo/data) subdirectory. This sample table along with the [project config file](simple_demo/project_config.yaml) together make up a standard PEP (see [pep.databio.org](http://pep.databio.org) for formal spec).
 
-We'd like to run our CWL workflow/tool on each of these samples, which means running it once per row in the sample table. We can accomplish this with [looper](http://looper.databio.org), which is an arbitrary command runner for PEP-formatted sample data. From a CWL perspective, looper is a *tabular scatterer* -- it will scatter a CWL workflow across each row in a sample table independently.
+We'd like to run our CWL workflow/tool on each of these samples, which means running it once per row in the sample table. We can accomplish this with [looper](http://looper.databio.org). From a CWL perspective, looper is a *tabular scatterer* -- it will scatter a CWL workflow across each row in a sample table independently.
 
 ### Using looper
 
-Looper uses a [pipeline interface](simple_demo/cwl_interface.yaml) to describe how to run `cwl-runner`. In this interface we've simply specified a `command_template:`, which looks like the above CWL command: `cwl-runner {pipeline.path} {sample.yaml_file}`. This command template uses two variables to construct the command: the `{pipeline.path}` refers to `wc-tool.cwl`, pointed to in the `path` attribute in the pipeline interface file. Looper also automatically creates a `yaml` file representing each sample, and the path is accessed with `{sample.yaml_file}`.
+Looper uses a [pipeline interface](simple_demo/cwl_interface.yaml) to describe how to run `cwl-runner`. In this interface we've simply specified a `command_template:`, which looks like the above CWL command: `cwl-runner wc-tool.yaml {sample.sample_yaml_cwl}`. The `{sample.sample_yaml_cwl}` is a variable that will be populated by looper to refer to a `yaml` file looper creates for each sample. This is a result of the `looper.write_sample_yaml_cwl` item listed in the `pre_submit` section of the pipeline interface.
 
 To run these commands, invoke `looper run`, passing the project configuration file, like this:
 
@@ -30,9 +38,8 @@ looper run project_config.yaml
 
 This will run the `cwl-runner wc-tool.cwl ...` command on *each row in the sample table*. While there is also a built-in CWL approach to scatter workflows, there are a few nice things about the looper approach:
 
-- you get all the benefits of PEP project formatting. PEPs are a completely independent specification for describing sample metadata, complete with an [independent validation platform called eido](http://eido.databio.org). PEP also provides powerful portability features like *derived attributes*, and *implied attributes*, which make it easier for you to use a single sample table that works across multiple pipelines and computing environments. PEP also provides project-level features: in a project config file, you can use *imports* to define a hierarchy of project settings, and *amendments* to design projects with similar sub-projects (such as a re-run of a particular sample table with slightly different parameters; or an exact re-run on a separate sample table). Finally, because PEP is independent, and not tied to a specific pipeline framework, your sample annotations are likely to be reusable across other pipelines; for instance, Snakemake can natively read a PEP-formatted project, so someone could take your data as input directly into a Snakemake workflow as well.
-
-- looper provides a CLI with lots of other nice features for job management, outlined below:
+- **PEP framework benefits**. PEPs are a third-party specification for describing sample metadata, complete with a [validation platform called eido](http://eido.databio.org). PEP provides powerful portability features like [derived attributes](http://pep.databio.org/en/latest/specification/#sample-modifier-derive), and [implied attributes](http://pep.databio.org/en/latest/specification/#sample-modifier-imply), which adjust sample attributes on-the-fly, and [project config imports](http://pep.databio.org/en/latest/specification/#project-modifier-import) and [project config amendments](http://pep.databio.org/en/latest/specification/#project-modifier-amend) to re-use project components and define sub-projects. Finally, because PEP is not tied to a specific pipeline framework, your sample annotations are reusable across other pipelines; for instance, Snakemake can also natively read a PEP-formatted project.
+- looper provides a CLI with lots of other nice features for job management, outlined in the [looper docs](http://looper.databio.org/en/latest/features/).
 
 ## Bioinformatics demo
 

diff --git a/bioinformatics_demo/bwa-tool.cwl b/bioinformatics_demo/bwa-tool.cwl
@@ -0,0 +1,136 @@
+#!/usr/bin/env cwl-runner
+
+cwlVersion: v1.0
+class: CommandLineTool
+
+requirements:
+  DockerRequirement:
+    dockerPull: "quay.io/biocontainers/bwa:0.7.17--hed695b0_7"
+
+inputs:
+  InputFile1:
+    type: File
+    inputBinding:
+      position: 201
+
+  InputFile2:
+    type: File
+    inputBinding:
+      position: 202
+
+  Index:
+    type: File
+    inputBinding:
+      position: 200
+    secondaryFiles:
+      - .amb
+      - .ann
+      - .bwt
+      - .pac
+      - .sa
+
+#Optional arguments
+
+  Threads:
+    type: int?
+    inputBinding:
+      prefix: "-t"
+
+  MinSeedLen:
+    type: int?
+    inputBinding:
+      prefix: "-k"
+
+  BandWidth:
+    type: int?
+    inputBinding:
+      prefix: "-w"
+
+  ZDropoff:
+    type: int?
+    inputBinding:
+      prefix: "-d"
+
+  SeedSplitRatio:
+    type: float?
+    inputBinding:
+      prefix: "-r"
+
+  MaxOcc:
+    type: int?
+    inputBinding:
+      prefix: "-c"
+
+  MatchScore:
+    type: int?
+    inputBinding:
+      prefix: "-A"
+
+  MmPenalty:
+    type: int?
+    inputBinding:
+      prefix: "-B"
+
+  GapOpenPen:
+    type: int?
+    inputBinding:
+      prefix: "-O"
+
+  GapExtPen:
+    type: int?
+    inputBinding:
+      prefix: "-E"
+
+  ClipPen:
+    type: int?
+    inputBinding:
+      prefix: "-L"
+
+  UnpairPen:
+    type: int?
+    inputBinding:
+      prefix: "-U"
+
+  RgLine:
+    type: string?
+    inputBinding:
+      prefix: "-R"
+
+  VerboseLevel:
+    type: int?
+    inputBinding:
+      prefix: "-v"
+
+  isOutSecAlign:
+    type: boolean?
+    inputBinding:
+      prefix: "-a"
+
+  isMarkShortSplit:
+    type: boolean?
+    inputBinding:
+      prefix: "-M"
+
+  isUseHardClip:
+    type: boolean?
+    inputBinding:
+      prefix: "-H"
+
+  isMultiplexedPair:
+    type: boolean?
+    inputBinding:
+      prefix: "-p"
+
+
+baseCommand: [bwa, mem]
+
+stdout: unsorted_reads.sam
+
+outputs:
+  reads_stdout:
+    type: stdout
+
+$namespaces:
+  edam: http://edamontology.org/
+$schemas:
+  - http://edamontology.org/EDAM_1.18.owl
diff --git a/bioinformatics_demo/bwa_cwl_interface.yaml b/bioinformatics_demo/bwa_cwl_interface.yaml
@@ -0,0 +1,12 @@
+pipeline_name: bwa_alignment
+pipeline_type: sample
+input_schema: bwa_input_schema.yaml
+var_templates:
+  main: "{looper.piface_dir}/bwa-tool.cwl"
+  refgenie_config: "$REFGENIE"
+pre_submit:
+  python_functions:
+  - refgenconf.looper_refgenie_populate
+  - looper.write_sample_yaml_cwl
+command_template: >
+  cwl-runner {pipeline.var_templates.main} {sample.sample_yaml_cwl}
diff --git a/bioinformatics_demo/bwa_input_schema.yaml b/bioinformatics_demo/bwa_input_schema.yaml
@@ -0,0 +1,40 @@
+description: A PEP for NGS samples being aligned using pep-pypiper bowtie2 pipeline
+imports: 
+  - http://schema.databio.org/pep/2.0.0.yaml
+properties:
+  samples:
+    type: array
+    items:
+      type: object
+      properties:
+        sample_name: 
+          type: string
+          description: "Name of the sample"
+        organism: 
+          type: string
+          description: "Organism"
+        genome:
+          type: string
+          description: "Refgenie genome registry identifier"
+        InputFile1:
+          type: string
+          description: "Fastq file for read 1"
+        InputFile2:
+          type: string
+          description: "Fastq file for read 2 (for paired-end experiments)"
+        Index:
+          type: string
+          description: Path to bwa index file folder
+      required:
+        - sample_name
+        - protocol
+        - InputFile1
+        - genome
+      required_files:
+        - InputFile1
+      files:
+        - InputFile1
+        - InputFile2
+        - Index
+required:
+  - samples
diff --git a/bioinformatics_demo/demo_sample_table.csv b/bioinformatics_demo/demo_sample_table.csv
@@ -1,3 +1,3 @@
 sample_name,protocol,organism,read1,read2
-sample1,bowtie2,human,FQ1,FQ2
-sample2,bowtie2,human,FQ1,FQ2
+sample1,RNA-seq,human,FQ1,FQ2
+sample2,RNA-seq,human,FQ1,FQ2
diff --git a/bioinformatics_demo/pep_bio.yaml b/bioinformatics_demo/pep_bio.yaml
@@ -2,17 +2,22 @@ pep_version: 2.0.0
 sample_table: demo_sample_table.csv
 sample_modifiers:
   append:
-    pipeline_interfaces: bt2_cwl_interface.yaml
+    Index: RG1
+    pipeline_interfaces: bwa_cwl_interface.yaml
+  duplicate:
+    read1: InputFile1
+    read2: InputFile2
   derive:
-    attributes: [read1, read2]
+    attributes: [read1, read2, Index, InputFile1, InputFile2]
     sources:
-      FQ1: "bioinformatics_demo/data/{sample_name}_1.fq.gz"
-      FQ2: "bioinformatics_demo/data/{sample_name}_2.fq.gz"
+      FQ1: "data/{sample_name}_1.fq.gz"
+      FQ2: "data/{sample_name}_2.fq.gz"
+      RG1: "refgenie://{genome}/bwa_index"
   imply:
     - if: 
         organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
       then: 
-        genome: hg38
+        genome: t7
 
 looper:
   output_dir: pipeline_results