Skip to content

Latest commit

 

History

History

code

2. 10X Genomics Whole Genome Sequencing Analysis

R script (titanCNA_v1.15.0_TenX.R) for running TitanCNA analysis on sequencing data with haplotype phasing such as 10X Genomics whole genome data

Input files

  1. GC-corrected, normalized read coverage using the HMMcopy suite
    a. Get molecule coverage by counting unique barcodes in non-overlapping bins. To do this, you will need the following installed:

    • bxtools
    • samtools To extract the molecule coverage at 10kb bins across chromosome 1 for both tumour and normal samples:
    samtools view -h -F 0x4 -q 60 tumour.bam 1 | bxtools tile - -b chr1.10kb.bed > tumour_bxTile/chr1.10kb.bxTile.bed  
    

    The chromosome bed files for 10kb bins is provided in TenX_scripts/data/10kb.bed.tar.gz. Just tar xvzf 10kb.bed.tar.gz to extract the files.

    b. Normalize the molecule coverage
    This analysis uses ichorCNA to correct molecular coverage by GC and mappability biases.
    See https://github.com/broadinstitute/ichorCNA/tree/master/scripts

    # from the command line
    >Rscript TenX_scripts/getMoleculeCoverage.R --help
    Usage: TenX_scripts/getMoleculeCoverage.R [options]
    
    
    Options:
            -t TUMORBXDIR, --tumorBXDir=TUMORBXDIR
                    Path to directory containing tumor bed files for each chromosome containing BX tags.
    
            -n NORMALBXDIR, --normalBXDir=NORMALBXDIR
                    Path to directory containing normal bed files for each chromosome containing BX tags.
    
            --minReadsPerBX=MINREADSPERBX
                    Minimum number of reads per barcode.
            
            --gcWig=GCWIG
                    Path to GC-content WIG file; Required
                    
            --libdir=LIBDIR
                    Script library path to include new or modified source R files (optional). Default: [NULL]
                
            (... plus other options)  
    
    

    Here is an example

    Rscript getMoleculeCoverage.R --id test  \
       --tumorBXDir tumour_bxTile/ --normalBXDir normal_bxTile/ --minReadsPerBX 2 \
       --chrs "c(1:22, \"X\")" --outDir ../ \
       --centromere TenX_scripts/data/GRCh37.p13_centromere_UCSC-gapTable.txt \
       --gcWig ichorCNA/inst/extdata/gc_hg38_1000kb.wig \
       --libdir TitanCNA/R/
    
  2. Tumour sample allelic read counts at heterozygous SNPs (identifed from the matched normal sample).
    a. Identify the phased heterozygous SNPs (excluding indels) from the LongRanger 2.1 analysis on the matched normal sample.
    Use this R script from the command line to process the phased variant file (*phased_variants.vcf.gz) and extracts heterozygous sites after filtering. The output VCF file suffix *_phasedHets.vcf

    # from the command line
    > Rscript TenX_scripts/getPhasedHETSitesFromLLRVCF.R --help
    Usage: Rscript getPhasedHETSitesFromLLRVCF.R [options]
    
    
    Options:
            -i INVCF, --inVCF=INVCF
                    LongRanger 2.1 phased variant result VCF file; typically has filename suffix "*phased_variants.vcf.gz". [Required]
    
            -q MINQUALITY, --minQuality=MINQUALITY
                    Heterozygous variants with QUAL greater than or equal to this value are considered. [Default: 100]
    
            -d MINDEPTH, --minDepth=MINDEPTH
                    Heterozygous variants with read depth greater than or equal to this value are considered. [Default: 10]
    
            -v MINVAF, --minVAF=MINVAF
                    Heterozygous variants with variant allele fraction or reference allele fraction greater than this value are considered. [Default: 0.25]
    
            -o OUTVCF, --outVCF=OUTVCF
                    Output VCF file with suffix "_phasedHets.vcf" [Required]
    
            -h, --help
                    Show this help message and exit
    

    b. Extract allelic read counts from the tumour sample
    Using this python script, extract the allele read counts at the heterozygous SNP sites identified in the previous step. This script is meant to be run per chromosome so that users can parallelize this step. Users will need to combine the chromosome files into a single tab-delimited file for input into the R script.

    # from the command line
    # run per chromosome
    # The `pysam` library will need to be installed in python.
    
    usage: python TenX_scripts/getTumourAlleleCountsAtHETSites.py $chr $phasedHetsVCF $bam $refFasta $baseQuality $mapQuality $vcfQuality > $tumCountsFile
    
    # $chr - Chromosome to analyze (e.g. 1)
    # $phasedHetsVCF - Output VCF file from getPhasedHETSitesFromLLRVCF.R
    # $bam - BAM file path of the tumour sample
    # $refFasta - File path to the reference FASTA file
    # $baseQuality - Minimum base quality to include in counts
    # $mapuality - Minimum mapping quality of reads to include in counts
    # $vcfQuality - Exclude HET sites with lower QUAL than this value
    # $tumCountsFile - Output tab-delimited file 
    

Running the R script

  1. Look at the usage of the R script
# from the command line
> Rscript titanCNA_v1.15.0_TenX.R --help
Usage: Rscript titanCNA_v1.15.0_TenX.R [options]


Options:
      --id=ID
              Sample ID

      --hetFile=HETFILE
              File containing allelic read counts at HET sites. (Required)
      
      (... other arguments similar to titanCNA.R)
      
      --alleleModel=ALLELEMODEL
              Emission density to use for allelic input data (binomial or Gaussian). [Default: binomial]
              
      --alphaR=ALPHAR
              Hyperparaemter on the Gaussian variance for allelic fraction data; used if --alleleModel="Gaussian". [Defaule: 5000]
      
      --haplotypeBinSize=HAPLOTYPEBINSIZE
              Bin size for summarizing phased haplotypes. [Default: 100000]

      --phaseSummarizeFun=PHASESUMMARIZEFUN
              Strategy to summarize specified haplotype bins: mean, sum, SNP. [Default: sum]

The specific arguments that are different compared to titanCNA.R are listed above. HETFILE should be the output of getTumourAlleleCountsAtHETSites.py.

  1. Example usage of R script
# normalized coverage file: test.cn.txt
# allelic read count file from getTumourAlleleCountsAtHETSites.py: test.het.txt
Rscript titanCNA.R --id test --hetFile test.het.txt --cnFile test.cn.txt \
  --numClusters 1 --numCores 1 --normal_0 0.5 --ploidy_0 2 \
  --chrs "c(1:22, \"X\")" --estimatePloidy TRUE --outDir ./ \
  --haplotypeBinSize 1e5 --phaseSummarizeFun sum --alleleModel Gaussian --alphaR 5000
  1. Running TitanCNA for multiple restarts and model selection
    Use the same approach as for Step 3 of Standard Whole Genome/Exome Sequencing Analysis.