Skip to content

shenyang1981/PARCEL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PARCEL

A computational pipeline for analyzing sequencing reads generated from PARCEL experiment to identify genomic regions with RNA strutual changes in transcripts.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

Operating Systems

Supported Unix distributions

  • Ubuntu
  • CentOS
  • Red Hat Enterprise Linux (please use the CentOS packages and instructions)

Job scheduler

  • Univa Grid Engine
  • TORQUE Resource Manager

Tools or packages

Libraries or modules

Perl
  • IO::File
  • IO::Handle
  • List::Util
  • Math::Random

can be installed by following commands:

perl -MCPAN -e "install App::Cpan"
cpan -i IO::Handle IO::File Math::Random List::Util
R

can be installed by following commands in R:

install.packages(c("argparse","adagio","bedr","data.table"));
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

Installing

Install snakemake

Install snakemake into a virtual environment

git clone https://bitbucket.org/snakemake/snakemake.git
cd snakemake
virtualenv -p python3 snakemake
source snakemake/bin/activate
python setup.py install

Download scripts and configuration files from github and add directory of scripts into PATH variable

git clone https://github.com/shenyang1981/PARCEL.git
cd PARCEL/; export PARCELSCRIPTS="${PWD}/scripts"; export PATH="${PARCELSCRIPTS}:$PATH"

You may consider put 'export PATH=${PARCELSCRIPTS}:$PATH' into your .bashrc file.

Prepare transcriptome and annotation file

Transcriptome file is in FASTA format and is indexed for Bowtie2.

  • transcriptome.fas -- transcriptome file
  • transcriptome.size -- Length of each transcript in format: transcriptID{tab}Length
  • cdsinfo.txt -- The start and end position of CDS in transcript: transcriptID{tab}start{tab}end{tab}Length

put all files into a folder, "database/C.albican/" for example. Build bowtie2 index with transcriptome file.

cd database/C.albican/
bowtie2-build transcriptome.fas transcriptome

Prepare input files and sample information

  • sampleList.txt -- information of each sequenced library, including library ID (LibID), condition or treatment (Condition), replicates (Replicates), sequencing batch (SeqBatch), experimental batch (ExperiementalBatch), comparison batch(ComparisonBatch). Samples belonged to the same comparison batch would be selected for pairwised comparison.

The format of sampleList.txt is like:

Species LibID Condition Replicates SeqBatch ExperiementalBatch ComparisonBatch
Candida V1_1 control rep1 seq1 1 batch1
Candida V1_2 control rep2 seq1 1 batch1
Candida V1_met_1 met rep1 seq1 1 batch1
Candida V1_met_2 met rep2 seq1 1 batch1

** Note: LibID should be unique as the corresponding sequence file should be named as {LibID}.fastq.gz.

  • input reads files -- Reads are single-end. Name of each file should be {LibID}.fastq.gz (LibID should be the same as in sampleList.txt). All of reads files from the same sequencing batch should be put into one folder named by {SeqBatch} as indiciated in the sampleList.txt. For example, reads files "V1_1.fastq.gz", "V1_2.fastq.gz", "V1_met_1.fastq.gz" and "V1_met_2.fastq.gz" can be put into folder "input/seq1/"
ls input/*
input/sampleList.txt

input/seq1:
V1_1.fastq.gz  V1_2.fastq.gz  V1_met_1.fastq.gz  V1_met_2.fastq.gz

generate config file

To generate a configuration file for snakemake, several variables need to be defined:

  • PARCELSCRIPTS: path to scripts used in pipeline
  • PARCELDB: path to folder where transcriptome files are
  • PARCELREADSROOT: path to root folder of sequenced reads
  • PARCELSAMPLEINFO: path to the sampleList.txt file
  • PARCELRESULTROOT: path to root folder of results
  • PARCELBATCH: batchID indicating which libraries should be selected
  • PARCELCONTROL: which condition should be used as control

Configuration file can be generated using script generateConfigureFile.sh

PARCELDB=database/C.albican/ PARCELREADSROOT=input/ PARCELSAMPLEINFO=input/sampleList.txt PARCELRESULTROOT=result/ PARCELBATCH=batch1 PARCELCONTROL=control generateConfigureFile.sh pipeline/config/conf.template.json > pipeline/config/conf.batch1.json

conf.batch1.json

Now, files should be organized like:

.
├── database
│   └── C.albican
│       ├── cdsinfo.txt
│       ├── transcriptome.1.bt2
│       ├── transcriptome.2.bt2
│       ├── transcriptome.3.bt2
│       ├── transcriptome.4.bt2
│       ├── transcriptome.fas
│       ├── transcriptome.rev.1.bt2
│       ├── transcriptome.rev.2.bt2
│       └── transcriptome.size
├── input
│   ├── sampleList.txt
│   └── seq1
│       ├── V1_1.fastq.gz
│       ├── V1_2.fastq.gz
│       ├── V1_met_1.fastq.gz
│       └── V1_met_2.fastq.gz
├── LICENSE.md
├── pipeline
│   ├── config
│   │   ├── conf.batch1.json
│   │   └── conf.template.json
│   └── parcel.sk
├── README.md
└── scripts
    ├── BamToPosCount.sh
    ├── bedGraphTrack.pl
    ├── definedVariable.sh
    ├── differential_Regions.R
    ├── differential_Sites.R
    ├── extractCoverageInfo.R
    ├── filtered_Regions.R
    ├── filterInspection.R
    ├── generateConfigureFile.sh
    ├── mapReadsToTranscriptom.sh
    ├── mergeCoverage.R
    ├── parallel_cutadpt.sh
    ├── parse_bam_best_parallel_random.sh
    ├── parse_bam_best_random.pl
    ├── parseCutAptLog.pl
    ├── qualityCheck.R
    ├── reshapeTable.R
    ├── runsnake.sh
    ├── splitBam.mawk
    └── sumBowtieMapResult.pl

Running Pipeline

Local Mode

The pipeline can be simply run in local mode with the configuration file.

source {$pathtosnakemake}/snakemake/bin/activate
snakemake -s pipeline/parcel.sk --configfile pipeline/config/conf.batch1.json -j 32

Submit snakemake jobs to Cluster

Or run it by submitting to the job scheduler

runsnake.sh pipeline/parcek.sk conf.batch1.json testjob 24 24

Results

After running pipeline, results would be stored in "result/" folder.

  • combined_met_output2_wfilters.txt -- Candidate regions.
  • combined_met_covinfo.xls -- Coverage information.
.
├── bedgraphs
│   └── Transcriptome
│       └── batch1
│           ├── control__V1_1__seq1__nor.bedgraph.gz
│           ├── control__V1_2__seq1__nor.bedgraph.gz
│           ├── met__V1_met_1__seq1__nor.bedgraph.gz
│           └── met__V1_met_2__seq1__nor.bedgraph.gz
├── coverageInfo
│   └── Transcriptome
│       └── seq1
│           ├── V1_1.cov.txt.gz
│           ├── V1_2.cov.txt.gz
│           ├── V1_met_1.cov.txt.gz
│           └── V1_met_2.cov.txt.gz
├── mappedResult
│   └── Transcriptome
│       ├── batch1
│       │   └── mapSummary.txt
│       └── seq1
│           ├── V1_1.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_1.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_1.trim.fastq.genome_mapping.log
│           ├── V1_1.trim.fastq.genome_mapping.summary
│           ├── V1_2.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_2.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_2.trim.fastq.genome_mapping.log
│           ├── V1_2.trim.fastq.genome_mapping.summary
│           ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_met_1.trim.fastq.genome_mapping.log
│           ├── V1_met_1.trim.fastq.genome_mapping.summary
│           ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam
│           ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam.bai
│           ├── V1_met_2.trim.fastq.genome_mapping.log
│           └── V1_met_2.trim.fastq.genome_mapping.summary
├── PARCEL
│   └── Transcriptome
│       └── batch1
│           ├── allcov.txt.gz
│           ├── allcov.wide.min2.txt.gz
│           ├── combined_met_covinfo.xls
│           ├── combined_met_output2_wfilters.Rdata
│           ├── combined_met_output2_wfilters.txt
│           ├── combined_v1all.Rdata
│           ├── covinfo_met.Rdata
│           ├── edgeR_met_sf.Rdata
│           ├── etTable_met.Rdata
│           └── fastq2_met_output10.Rdata
├── qualityCheck
│   └── batch1
│       ├── all.Rdata
│       └── processingSummary.xls
└── trimmedFastq
    ├── batch1
    │   └── trimSummary.txt
    └── seq1
        ├── read.trim.V1_1.log
        ├── read.trim.V1_1.log.sum
        ├── read.trim.V1_2.log
        ├── read.trim.V1_2.log.sum
        ├── read.trim.V1_met_1.log
        ├── read.trim.V1_met_1.log.sum
        ├── read.trim.V1_met_2.log
        ├── read.trim.V1_met_2.log.sum
        ├── V1_1.trim.fastq.gz
        ├── V1_2.trim.fastq.gz
        ├── V1_met_1.trim.fastq.gz
        └── V1_met_2.trim.fastq.gz

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

  • Miao Sun - Original Author and Development
  • Yang Shen - Snakemake supported

Contact

Please contact us if you find bugs, have suggestions, need help etc. You can either use our mailing list or send us an email:

PARCEL is developed in the Genome Institute of Singapore

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Computational Pipeline for PARCEL

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published