a snakemake workflow to run SPRINTER on single-cell whole-genome DNA sequencing data with an individual non-barcoded bam file per cell
This snakemake workflow was written to prepare the necessary input for SPRINTER, a tool to identify replicating cells from single-cell whole-genome DNA sequencing data. From the DLP+ pipeline we receive a non-barcoded individual bam file per cell from the experiment. This is the correct input format for scAbsolute. However, SPRINTER takes a single barcoded bam file including all cells as input. It is thus necessary to generate a barcode for each cell and merge the individual bam files into one file. This workflow takes the aligned bam files from the snakemake-illumina-alignment-scMulti pipeline as input.
- use aligned bam files as input
- generate a 16-character long, random, and unique barcode made up of
[A, C, T, G]for each cell - if changes have been made to the script, compile the C++ sript
mark_and_merge_bams_samplesheet.cppfor merging the single bam files - use compiled C++ script to barcode and merge bam files of individual cells into one bam file and change chromosome names to chr1, chr2 etc. to be compatible with CHISEL
sortandindexthe merged bam file with samtools- run
chisel_rdrcommand to calculate read depth ratio (RDR) values and prepare SPRINTER input .tsv file gzipthe .tsv file and run SPRINTER on the prepared .tsv.gz file
- Make sure you have installed snakemake
- Complete all work within your snakemake conda environment.
- Clone this repository using
git clone.
- Prepare your input data using the snakemake-illumina-alignment-scMulti pipeline which outputs an aligned bam file for each cell in the experiment. The bam files should be placed in
results/singlebams/{SLX}in a separate directory using the sample name. - The sample name (SLX) and path to the reference genome need to be adjusted in the configuration file
config/main.yaml. The profile is currently set to run on the rocm partition of the cluster, this can be adjusted inprofiles/single_cell_WGS/config.yaml. - The makebarc.py script generates random and unique barcodes from
[A, C, G, T]and a length currently set to 16 characters. This can be adjusted in the script. The output of the script is a .tsv file with the bam file and the matched newly generated barcode. - The
mark_and_merge_bams_samplesheettool takes the single bams, the barcode sheet, and a reference sheet with the old and new chromosome names as input. The reference sheet should be a .tsv file and should be placed inresults/ref_sheet. If any changes are made to the C++ script, it needs to be recompiled using thecompile_cpprule in the assigned conda environment. Otherwise the precompiled tool can be used in the same conda environment specified byenvs/merge.yaml. The tool adds the newly generated to the bam headers and changes the chromosome names according to the reference sheet (e.g. "1" -> "chr1"). It then merges the bam files into one bam file. Depending on the size of the dataset this step is going to take a couple of hours. - The new merged bam file is then sorted and indexed with the respetive
samtoolscommands. - The sorted and indexed bam file can be used as input to the CHISEL command
chisel_rdr. This command also takes the reference genome specified in the configuration file as input. It also takes a couple parameters which can be adjusted:
-bsets the size of the genomic bin (default: 50000), SPRINTER expects a bin size of 50kb, so this should not be changed-msets the minimum number of total reads to select cells (default: 100000)chisel_rdrcreates multiple directories, some of which are empty and can be deleted. The relevant output for the pipeline is therdr/rdr.tsvfile.
- The
run_sprinterrule then usesgzipon the rdr.tsv file and uses the resulting .tsv.gz file as input for SPRINTER, alongside the reference genome. SPRINTER also has a couple of parameters that can be tuned to improve the prediction.
minreadsis the minimum number of total sequencing reads that a cell needs to have in order to enter the SPRINTER analysis (default: 100000).rtreadsis the target number of reads used for replication-timing analyses (default: 200).cnreadsis the target number of reads used for the copy-number analyses (default: 1000), this can be adjusted to suit the size of the copy-number aberrations that you want to investigate.minnumcellsdefines the minimum expected number of cells to define clones (default: 15). The main output of the SPRINTER pipeline is thesprinter.output.tsv.gzfile, which contains the inferred clone and cell cycle phase for every analysed cell. Further parameters for tuning and more detailed information can be found on the SPRINTER repository.
The whole workflow can be executed by running the run_sprinterwf.sh bash script or directly using the following command in the activated snakemake conda environment:
snakemake -p --workflow-profile workflow/profiles/single_cell_WGS --use-conda --profile slurm --rerun-incomplete