A computational pipeline for analyzing sequencing reads generated from PARCEL experiment to identify genomic regions with RNA strutual changes in transcripts.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
- Ubuntu
- CentOS
- Red Hat Enterprise Linux (please use the CentOS packages and instructions)
- Univa Grid Engine
- TORQUE Resource Manager
- perl >= 5.10
- python >= 3.5.1 (for snakemake)
- R >= 3.1.0
- GNU parallel >= 20150222
- GNU sort >= (GNU coreutils) 8.23
- pigz >= 2.3.1
- mawk >= 1.3.4
- bedtools >= 2.25.0
- snakemake >= 3.12.0
- cutadapt >= 1.8.1
- samtools >= 1.3.1
- bowtie2 >= 2.2.4
- IO::File
- IO::Handle
- List::Util
- Math::Random
can be installed by following commands:
perl -MCPAN -e "install App::Cpan"
cpan -i IO::Handle IO::File Math::Random List::Util
- argparse
- adagio
- data.table >= 1.10.0
- edgeR
- bedr >= 1.0.2
can be installed by following commands in R:
install.packages(c("argparse","adagio","bedr","data.table"));
source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
Install snakemake into a virtual environment
git clone https://bitbucket.org/snakemake/snakemake.git
cd snakemake
virtualenv -p python3 snakemake
source snakemake/bin/activate
python setup.py install
Download scripts and configuration files from github and add directory of scripts into PATH variable
git clone https://github.com/shenyang1981/PARCEL.git
cd PARCEL/; export PARCELSCRIPTS="${PWD}/scripts"; export PATH="${PARCELSCRIPTS}:$PATH"
You may consider put 'export PATH=${PARCELSCRIPTS}:$PATH' into your .bashrc file.
Transcriptome file is in FASTA format and is indexed for Bowtie2.
- transcriptome.fas -- transcriptome file
- transcriptome.size -- Length of each transcript in format: transcriptID{tab}Length
- cdsinfo.txt -- The start and end position of CDS in transcript: transcriptID{tab}start{tab}end{tab}Length
put all files into a folder, "database/C.albican/" for example. Build bowtie2 index with transcriptome file.
cd database/C.albican/
bowtie2-build transcriptome.fas transcriptome
- sampleList.txt -- information of each sequenced library, including library ID (LibID), condition or treatment (Condition), replicates (Replicates), sequencing batch (SeqBatch), experimental batch (ExperiementalBatch), comparison batch(ComparisonBatch). Samples belonged to the same comparison batch would be selected for pairwised comparison.
The format of sampleList.txt is like:
Species | LibID | Condition | Replicates | SeqBatch | ExperiementalBatch | ComparisonBatch |
---|---|---|---|---|---|---|
Candida | V1_1 | control | rep1 | seq1 | 1 | batch1 |
Candida | V1_2 | control | rep2 | seq1 | 1 | batch1 |
Candida | V1_met_1 | met | rep1 | seq1 | 1 | batch1 |
Candida | V1_met_2 | met | rep2 | seq1 | 1 | batch1 |
** Note: LibID should be unique as the corresponding sequence file should be named as {LibID}.fastq.gz.
- input reads files -- Reads are single-end. Name of each file should be {LibID}.fastq.gz (LibID should be the same as in sampleList.txt). All of reads files from the same sequencing batch should be put into one folder named by {SeqBatch} as indiciated in the sampleList.txt. For example, reads files "V1_1.fastq.gz", "V1_2.fastq.gz", "V1_met_1.fastq.gz" and "V1_met_2.fastq.gz" can be put into folder "input/seq1/"
ls input/*
input/sampleList.txt
input/seq1:
V1_1.fastq.gz V1_2.fastq.gz V1_met_1.fastq.gz V1_met_2.fastq.gz
To generate a configuration file for snakemake, several variables need to be defined:
- PARCELSCRIPTS: path to scripts used in pipeline
- PARCELDB: path to folder where transcriptome files are
- PARCELREADSROOT: path to root folder of sequenced reads
- PARCELSAMPLEINFO: path to the sampleList.txt file
- PARCELRESULTROOT: path to root folder of results
- PARCELBATCH: batchID indicating which libraries should be selected
- PARCELCONTROL: which condition should be used as control
Configuration file can be generated using script generateConfigureFile.sh
PARCELDB=database/C.albican/ PARCELREADSROOT=input/ PARCELSAMPLEINFO=input/sampleList.txt PARCELRESULTROOT=result/ PARCELBATCH=batch1 PARCELCONTROL=control generateConfigureFile.sh pipeline/config/conf.template.json > pipeline/config/conf.batch1.json
conf.batch1.json
Now, files should be organized like:
.
├── database
│ └── C.albican
│ ├── cdsinfo.txt
│ ├── transcriptome.1.bt2
│ ├── transcriptome.2.bt2
│ ├── transcriptome.3.bt2
│ ├── transcriptome.4.bt2
│ ├── transcriptome.fas
│ ├── transcriptome.rev.1.bt2
│ ├── transcriptome.rev.2.bt2
│ └── transcriptome.size
├── input
│ ├── sampleList.txt
│ └── seq1
│ ├── V1_1.fastq.gz
│ ├── V1_2.fastq.gz
│ ├── V1_met_1.fastq.gz
│ └── V1_met_2.fastq.gz
├── LICENSE.md
├── pipeline
│ ├── config
│ │ ├── conf.batch1.json
│ │ └── conf.template.json
│ └── parcel.sk
├── README.md
└── scripts
├── BamToPosCount.sh
├── bedGraphTrack.pl
├── definedVariable.sh
├── differential_Regions.R
├── differential_Sites.R
├── extractCoverageInfo.R
├── filtered_Regions.R
├── filterInspection.R
├── generateConfigureFile.sh
├── mapReadsToTranscriptom.sh
├── mergeCoverage.R
├── parallel_cutadpt.sh
├── parse_bam_best_parallel_random.sh
├── parse_bam_best_random.pl
├── parseCutAptLog.pl
├── qualityCheck.R
├── reshapeTable.R
├── runsnake.sh
├── splitBam.mawk
└── sumBowtieMapResult.pl
The pipeline can be simply run in local mode with the configuration file.
source {$pathtosnakemake}/snakemake/bin/activate
snakemake -s pipeline/parcel.sk --configfile pipeline/config/conf.batch1.json -j 32
Or run it by submitting to the job scheduler
runsnake.sh pipeline/parcek.sk conf.batch1.json testjob 24 24
After running pipeline, results would be stored in "result/" folder.
- combined_met_output2_wfilters.txt -- Candidate regions.
- combined_met_covinfo.xls -- Coverage information.
.
├── bedgraphs
│ └── Transcriptome
│ └── batch1
│ ├── control__V1_1__seq1__nor.bedgraph.gz
│ ├── control__V1_2__seq1__nor.bedgraph.gz
│ ├── met__V1_met_1__seq1__nor.bedgraph.gz
│ └── met__V1_met_2__seq1__nor.bedgraph.gz
├── coverageInfo
│ └── Transcriptome
│ └── seq1
│ ├── V1_1.cov.txt.gz
│ ├── V1_2.cov.txt.gz
│ ├── V1_met_1.cov.txt.gz
│ └── V1_met_2.cov.txt.gz
├── mappedResult
│ └── Transcriptome
│ ├── batch1
│ │ └── mapSummary.txt
│ └── seq1
│ ├── V1_1.trim.fastq.genome_mapping_best.sort.bam
│ ├── V1_1.trim.fastq.genome_mapping_best.sort.bam.bai
│ ├── V1_1.trim.fastq.genome_mapping.log
│ ├── V1_1.trim.fastq.genome_mapping.summary
│ ├── V1_2.trim.fastq.genome_mapping_best.sort.bam
│ ├── V1_2.trim.fastq.genome_mapping_best.sort.bam.bai
│ ├── V1_2.trim.fastq.genome_mapping.log
│ ├── V1_2.trim.fastq.genome_mapping.summary
│ ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam
│ ├── V1_met_1.trim.fastq.genome_mapping_best.sort.bam.bai
│ ├── V1_met_1.trim.fastq.genome_mapping.log
│ ├── V1_met_1.trim.fastq.genome_mapping.summary
│ ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam
│ ├── V1_met_2.trim.fastq.genome_mapping_best.sort.bam.bai
│ ├── V1_met_2.trim.fastq.genome_mapping.log
│ └── V1_met_2.trim.fastq.genome_mapping.summary
├── PARCEL
│ └── Transcriptome
│ └── batch1
│ ├── allcov.txt.gz
│ ├── allcov.wide.min2.txt.gz
│ ├── combined_met_covinfo.xls
│ ├── combined_met_output2_wfilters.Rdata
│ ├── combined_met_output2_wfilters.txt
│ ├── combined_v1all.Rdata
│ ├── covinfo_met.Rdata
│ ├── edgeR_met_sf.Rdata
│ ├── etTable_met.Rdata
│ └── fastq2_met_output10.Rdata
├── qualityCheck
│ └── batch1
│ ├── all.Rdata
│ └── processingSummary.xls
└── trimmedFastq
├── batch1
│ └── trimSummary.txt
└── seq1
├── read.trim.V1_1.log
├── read.trim.V1_1.log.sum
├── read.trim.V1_2.log
├── read.trim.V1_2.log.sum
├── read.trim.V1_met_1.log
├── read.trim.V1_met_1.log.sum
├── read.trim.V1_met_2.log
├── read.trim.V1_met_2.log.sum
├── V1_1.trim.fastq.gz
├── V1_2.trim.fastq.gz
├── V1_met_1.trim.fastq.gz
└── V1_met_2.trim.fastq.gz
We use SemVer for versioning. For the versions available, see the tags on this repository.
- Miao Sun - Original Author and Development
- Yang Shen - Snakemake supported
Please contact us if you find bugs, have suggestions, need help etc. You can either use our mailing list or send us an email:
PARCEL is developed in the Genome Institute of Singapore
This project is licensed under the MIT License - see the LICENSE.md file for details