Skip to content

populationgenomics/sgs-somatic-mutation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 

Repository files navigation

deCODE method to detect CH in TOB (deCODE pipeline)


What is deCODE method?

deCODE method refers to the method used to identify individuals with clonal hematopoiesis (CH) without candidate driver mutations. It was first described in the CHIP paper from the deCODE consortium.

"CH arises when a substantial proportion of mature blood cells is derived from a single dominant HSC lineage. Somatic mutations in candidate driver (CD) genes are thought to be responsible for at least some cases of CH."

Briefly, this method extracts (singleton) mutations that occurred only once in their cohort (WGS of 11,262 Icelanders) and imposes a VAF restrictions to identify mosaic somatic mutations. The reason is that for such a large cohort, they believed germline variants were most likely to be observed more than once in their samples.

Although TOB has a much smaller sample size, we can impose a pop AF (allele frequency) restriction to singleton mutations and treated these variants as somatic mutations, which can be used to identify CH carriers.

Details

First perform QC on TOB's Hail MatrixTable data (following gnomAD's blog and genebass paper); and then apply deCODE method specific filters on the data. After QC, identify singleton mutations & export to a pVCF file used for downstream analysis (i.e., identify CH carriers).

  • Step 1 - read & densify mt data
    • Alignment was done with DragMap (?)
    • Cram -> ... -> gVCF (sample-level) -> hail Matrix Table (mt.v7)
    • Read & density mt data
  • Step 2 - Sample-level QC
    • Restrict to samples with imputed sex equals to XX (Female) or XY (Male)
    • Exclude samples if call rate < 0.99 or mean coverage < 20X
    • Exclude related samples
    • Skip sample QC metric outlier filtering
    • Skip ancestry checks (all Europeans)
  • Step 3 - Variant-level QC
    • Restrict to bi-allelic variants
    • Variant filtering with GATK recommended hard filters (thresholds differ between SNVs and INDELs)
    • Exclude variants with inbreeding coefficient < -0.3 or low quality (GQ < 20, DP < 10)
      • Inbreeding coefficient is calculated using bi_allelic_site_inbreeding_expr() imported from gnomad.utils.annotations, adapted from cpg_workflows
  • Step 4 - deCODE specific filters
    • Identify singleton mutations (mutations that occurred only once in our cohort)
    • Exclude variants with DP < 16 or GQ < 90
    • Exclude variants in simple repeat regions (i.e., defined by combining the entire Simple Tandem Repeats by TRF track in UCSC hg38 with all homopolymer regions in hg38 of length 6bp or more)
  • Step 5 - Annotations
    • VEP annotations
    • gnomAD allele freq
  • Step 6 - Export to a new Hail MatrixTable

How to run this script?

# Make sure that one have logged into GCP
gcloud auth application-default login

# activate the environment for running analysis-runner
conda activate CPG (python 3.11)
conda activate analysis-runner (python 3.10)

Example 1:

chr="M"
analysis-runner --dataset sgs-somatic-mtn \
    --access-level test \
    --output-dir "deCODE" \
    --description "Test deCODE pipeline" \
    python3 deCODE_hard_filters.py --input-mt mt/v7.mt --chrom chr${chr} \
            --regions-file gs://cpg-sgs-somatic-mtn-test-upload/Simple_Repeat_Regions_GRCh38_Excluded_Unmapped_Regions.bed \
            --vep-annotation tob_wgs_vep/v7_vep_108.2/vep108.2_GRCh38.ht \
            --gnomad-file gs://cpg-common-main/references/seqr/v0/combined_reference_data_grch38.ht \
            --output-mt deCODE_test_chr${chr}.mt

Example 2:

for chr in {{1..22},{'X','Y','M'}}
do
analysis-runner --dataset sgs-somatic-mtn \
    --access-level test \
    --output-dir "deCODE_pipeline" \
    --description "Test deCODE pipeline" \
    python3 deCODE_hard_filters.py --input-mt mt/v7.mt --chrom chr${chr} \
            --regions-file gs://cpg-sgs-somatic-mtn-test-upload/Simple_Repeat_Regions_GRCh38_Excluded_Unmapped_Regions.bed \
            --vep-annotation tob_wgs_vep/v7_vep_108.2/vep108.2_GRCh38.ht \
            --gnomad-file gs://cpg-common-main/references/seqr/v0/combined_reference_data_grch38.ht \
            --output-mt deCODE_test_chr${chr}.mt
done

Example 3:

for chr in {{1..22},{'X','Y','M'}}
do
analysis-runner --dataset sgs-somatic-mtn \
    --access-level standard \
    --output-dir "deCODE_pipeline" \
    --description "Submit deCODE pipeline through hail batch" \
    python3 deCODE_hard_filters.py --input-mt mt/v7.mt --chrom chr${chr} \
            --regions-file gs://cpg-sgs-somatic-mtn-test-upload/Simple_Repeat_Regions_GRCh38_Excluded_Unmapped_Regions.bed \
            --vep-annotation tob_wgs_vep/v7_vep_108.2/vep108.2_GRCh38.ht \
            --gnomad-file gs://cpg-common-main/references/seqr/v0/combined_reference_data_grch38.ht \
            --output-mt deCODE_chr${chr}.mt
done

About

Genomic Medicine Lab repository for somatic variant analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages