PARCEL

A computational pipeline for analyzing sequencing reads generated from PARCEL experiment to identify genomic regions with RNA strutual changes in transcripts.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Operating Systems

Supported Unix distributions

Ubuntu
CentOS
Red Hat Enterprise Linux (please use the CentOS packages and instructions)

Job scheduler

Univa Grid Engine
TORQUE Resource Manager

Tools or packages

perl >= 5.10
python >= 3.5.1 (for snakemake)
R >= 3.1.0
GNU parallel >= 20150222
GNU sort >= (GNU coreutils) 8.23
pigz >= 2.3.1
mawk >= 1.3.4
bedtools >= 2.25.0
snakemake >= 3.12.0
cutadapt >= 1.8.1
samtools >= 1.3.1
bowtie2 >= 2.2.4

Libraries or modules

Perl

IO::File
IO::Handle
List::Util
Math::Random

can be installed by following commands:

perl -MCPAN -e "install App::Cpan"
cpan -i IO::Handle IO::File Math::Random List::Util

R

argparse
adagio
data.table >= 1.10.0
edgeR
bedr >= 1.0.2

can be installed by following commands in R:

install.packages(c("argparse","adagio","bedr","data.table"));
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

Installing

Install snakemake

Install snakemake into a virtual environment

git clone https://bitbucket.org/snakemake/snakemake.git
cd snakemake
virtualenv -p python3 snakemake
source snakemake/bin/activate
python setup.py install

Download scripts and configuration files from github and add directory of scripts into PATH variable

git clone https://github.com/shenyang1981/PARCEL.git
cd PARCEL/; export PARCELSCRIPTS="${PWD}/scripts"; export PATH="${PARCELSCRIPTS}:$PATH"

You may consider put 'export PATH=${PARCELSCRIPTS}:$PATH' into your .bashrc file.

Prepare transcriptome and annotation file

Transcriptome file is in FASTA format and is indexed for Bowtie2.

transcriptome.fas -- transcriptome file
transcriptome.size -- Length of each transcript in format: transcriptID{tab}Length
cdsinfo.txt -- The start and end position of CDS in transcript: transcriptID{tab}start{tab}end{tab}Length

put all files into a folder, "database/C.albican/" for example. Build bowtie2 index with transcriptome file.

cd database/C.albican/
bowtie2-build transcriptome.fas transcriptome

Prepare input files and sample information

sampleList.txt -- information of each sequenced library, including library ID (LibID), condition or treatment (Condition), replicates (Replicates), sequencing batch (SeqBatch), experimental batch (ExperiementalBatch), comparison batch(ComparisonBatch). Samples belonged to the same comparison batch would be selected for pairwised comparison.

The format of sampleList.txt is like:

Species	LibID	Condition	Replicates	SeqBatch	ExperiementalBatch	ComparisonBatch
Candida	V1_1	control	rep1	seq1	1	batch1
Candida	V1_2	control	rep2	seq1	1	batch1
Candida	V1_met_1	met	rep1	seq1	1	batch1
Candida	V1_met_2	met	rep2	seq1	1	batch1

** Note: LibID should be unique as the corresponding sequence file should be named as {LibID}.fastq.gz.

input reads files -- Reads are single-end. Name of each file should be {LibID}.fastq.gz (LibID should be the same as in sampleList.txt). All of reads files from the same sequencing batch should be put into one folder named by {SeqBatch} as indiciated in the sampleList.txt. For example, reads files "V1_1.fastq.gz", "V1_2.fastq.gz", "V1_met_1.fastq.gz" and "V1_met_2.fastq.gz" can be put into folder "input/seq1/"

ls input/*
input/sampleList.txt

input/seq1:
V1_1.fastq.gz  V1_2.fastq.gz  V1_met_1.fastq.gz  V1_met_2.fastq.gz

generate config file

To generate a configuration file for snakemake, several variables need to be defined:

PARCELSCRIPTS: path to scripts used in pipeline
PARCELDB: path to folder where transcriptome files are
PARCELREADSROOT: path to root folder of sequenced reads
PARCELSAMPLEINFO: path to the sampleList.txt file
PARCELRESULTROOT: path to root folder of results
PARCELBATCH: batchID indicating which libraries should be selected
PARCELCONTROL: which condition should be used as control

Configuration file can be generated using script generateConfigureFile.sh

PARCELDB=database/C.albican/ PARCELREADSROOT=input/ PARCELSAMPLEINFO=input/sampleList.txt PARCELRESULTROOT=result/ PARCELBATCH=batch1 PARCELCONTROL=control generateConfigureFile.sh pipeline/config/conf.template.json > pipeline/config/conf.batch1.json

conf.batch1.json

Now, files should be organized like:

.
├── database
│   └── C.albican
│       ├── cdsinfo.txt
│       ├── transcriptome.1.bt2
│       ├── transcriptome.2.bt2
│       ├── transcriptome.3.bt2
│       ├── transcriptome.4.bt2
│       ├── transcriptome.fas
│       ├── transcriptome.rev.1.bt2
│       ├── transcriptome.rev.2.bt2
│       └── transcriptome.size
├── input
│   ├── sampleList.txt
│   └── seq1
│       ├── V1_1.fastq.gz
│       ├── V1_2.fastq.gz
│       ├── V1_met_1.fastq.gz
│       └── V1_met_2.fastq.gz
├── LICENSE.md
├── pipeline
│   ├── config
│   │   ├── conf.batch1.json
│   │   └── conf.template.json
│   └── parcel.sk
├── README.md
└── scripts
    ├── BamToPosCount.sh
    ├── bedGraphTrack.pl
    ├── definedVariable.sh
    ├── differential_Regions.R
    ├── differential_Sites.R
    ├── extractCoverageInfo.R
    ├── filtered_Regions.R
    ├── filterInspection.R
    ├── generateConfigureFile.sh
    ├── mapReadsToTranscriptom.sh
    ├── mergeCoverage.R
    ├── parallel_cutadpt.sh
    ├── parse_bam_best_parallel_random.sh
    ├── parse_bam_best_random.pl
    ├── parseCutAptLog.pl
    ├── qualityCheck.R
    ├── reshapeTable.R
    ├── runsnake.sh
    ├── splitBam.mawk
    └── sumBowtieMapResult.pl

Running Pipeline

Local Mode

The pipeline can be simply run in local mode with the configuration file.

source {$pathtosnakemake}/snakemake/bin/activate
snakemake -s pipeline/parcel.sk --configfile pipeline/config/conf.batch1.json -j 32

Submit snakemake jobs to Cluster

Or run it by submitting to the job scheduler

runsnake.sh pipeline/parcek.sk conf.batch1.json testjob 24 24

Results

After running pipeline, results would be stored in "result/" folder.

combined_met_output2_wfilters.txt -- Candidate regions.
combined_met_covinfo.xls -- Coverage information.

.
├── bedgraphs
│   └── Transcriptome
│       └── batch1
│           ├── control__V1_1__seq1__nor.bedgraph.gz
│           ├── control__V1_2__seq1__nor.bedgraph.gz
│           ├── met__V1_met_1__seq1__nor.bedgraph.gz
│           └── met__V1_met_2__seq1__nor.bedgraph.gz
├── coverageInfo
│   └── Transcriptome
│       └── seq1
│           ├── V1_1.cov.txt.gz
│           ├── V1_2.cov.txt.gz
│           ├── V1_met_1.cov.txt.gz
│           └── V1_met_2.cov.txt.gz
├── mappedResult
│   └── Transcriptome
│       ├── batch1
│       │   └── mapSummary.txt
│       └── seq1
│           ├── V1_1.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_1.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_1.trim.fastq.genome_mapping.log
│           ├── V1_1.trim.fastq.genome_mapping.summary
│           ├── V1_2.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_2.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_2.trim.fastq.genome_mapping.log
│           ├── V1_2.trim.fastq.genome_mapping.summary
│           ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_met_1.trim.fastq.genome_mapping.log
│           ├── V1_met_1.trim.fastq.genome_mapping.summary
│           ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_met_2.trim.fastq.genome_mapping.log
│           └── V1_met_2.trim.fastq.genome_mapping.summary
├── PARCEL
│   └── Transcriptome
│       └── batch1
│           ├── allcov.txt.gz
│           ├── allcov.wide.min2.txt.gz
│           ├── combined_met_covinfo.xls
│           ├── combined_met_output2_wfilters.Rdata
│           ├── combined_met_output2_wfilters.txt
│           ├── combined_v1all.Rdata
│           ├── covinfo_met.Rdata
│           ├── edgeR_met_sf.Rdata
│           ├── etTable_met.Rdata
│           └── fastq2_met_output10.Rdata
├── qualityCheck
│   └── batch1
│       ├── all.Rdata
│       └── processingSummary.xls
└── trimmedFastq
    ├── batch1
    │   └── trimSummary.txt
    └── seq1
        ├── read.trim.V1_1.log
        ├── read.trim.V1_1.log.sum
        ├── read.trim.V1_2.log
        ├── read.trim.V1_2.log.sum
        ├── read.trim.V1_met_1.log
        ├── read.trim.V1_met_1.log.sum
        ├── read.trim.V1_met_2.log
        ├── read.trim.V1_met_2.log.sum
        ├── V1_1.trim.fastq.gz
        ├── V1_2.trim.fastq.gz
        ├── V1_met_1.trim.fastq.gz
        └── V1_met_2.trim.fastq.gz

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

Miao Sun - Original Author and Development
Yang Shen - Snakemake supported

Contact

Please contact us if you find bugs, have suggestions, need help etc. You can either use our mailing list or send us an email:

PARCEL is developed in the Genome Institute of Singapore

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PARCEL

Getting Started

Prerequisites

Operating Systems

Supported Unix distributions

Job scheduler

Tools or packages

Libraries or modules

Perl

R

Installing

Install snakemake

Download scripts and configuration files from github and add directory of scripts into PATH variable

Prepare transcriptome and annotation file

Prepare input files and sample information

generate config file

Running Pipeline

Local Mode

Submit snakemake jobs to Cluster

Results

Versioning

Authors

Contact

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
database/C.albican		database/C.albican
input		input
pipeline		pipeline
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md

License

shenyang1981/PARCEL

Folders and files

Latest commit

History

Repository files navigation

PARCEL

Getting Started

Prerequisites

Operating Systems

Supported Unix distributions

Job scheduler

Tools or packages

Libraries or modules

Perl

R

Installing

Install snakemake

Download scripts and configuration files from github and add directory of scripts into PATH variable

Prepare transcriptome and annotation file

Prepare input files and sample information

generate config file

Running Pipeline

Local Mode

Submit snakemake jobs to Cluster

Results

Versioning

Authors

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages