# MIT License + +Copyright (c) 2020 CAMDAC + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. ++ +
contributing.RmdTo contribute to CAMDAC, fork the repository and install the development dependencies with remotes::install_dev_deps('.').
After making your changes, run the test and build commands listed below, then submit a pull request with the changes on your fork.
+
+library(devtools)
+
+# Install dev dependencies
+devtools::install_dev_deps("VanLoo-lab/CAMDAC")
+
+# Update docs
+devtools::document()
+
+# Run tests
+devtools::test()
+
+# Build readme
+rmarkdown::render('README.Rmd', output_format='github_document', output_file='README.md')
+
+# Check package builds
+devtools::check()
+
+# Build documentation
+pkgdown::build_site(examples=FALSE, devel=TRUE, lazy=TRUE, preview=FALSE)
+pkgdown::preview_site() # To view. Or: python3 -m http.server --directory docs 8000
+
+# Commit changes on the docs/ folder before submitting experimental.RmdThis document describes experimental features of the CAMDAC package. These features are not yet fully tested and may change in future releases. The following features are currently under development for the WGBS pipeline only:
+The CAMDAC equation can be used to infer pure tumour DNA methylation rates, provided the following information is available per CpG:
+Here is an example for 5 CpGs from a single sample. Note: the normal copy number state is assumed diploid (2) in humans:
+
+
+# Set parameters
+bulk = c(0.3, 0.5, 0.2, 0.1, 0.9)
+normal = c(0.3, 0.9, 0.1, 0.7, 0.5)
+ploidy = c(2, 2, 1, 3, 4)
+purity = 0.8
+
+# Deconvolve methylation rates
+pure_meth = CAMDAC:::calculate_mt(bulk, normal, purity, ploidy)
+
+# Set clean rates based on threshold
+pure_meth_clean = dplyr::case_when(
+ pure_meth < 0 ~ 0,
+ pure_meth > 1 ~ 1,
+ TRUE ~ pure_meth
+)After deconvolution, it may be useful to estimate the CpG coverage in the deconvolved tumour sample. Additionally, the highest density interval (HDI) of the methylation rate may be informative for quality control. These metrics can be calculated given additional information on bulk methylated and unmethylated read counts:
+
+
+# Optional: calculate effective coverage of the tumour
+# # Requires coverage per CpG in the bulk sample
+bulk_coverage = c(10, 20, 5, 15, 30)
+pure_effective_coverage = CAMDAC:::calculate_mt_cov(bulk_coverage, purity, ploidy)
+
+# Optional: calculate the HDI of the pure tumour methylation rate
+bulk_methylated_count = c(3, 10, 1, 2, 27)
+bulk_unmethylated_count = c(7, 10, 4, 13, 3)
+normal_methylated_count = c(3, 9, 1, 5, 2)
+normal_unmethylated_count = c(7, 11, 3, 8, 3)
+
+# HDI function (fast)
+CAMDAC:::hdi_norm_approx(
+ bulk_methylated_count,
+ bulk_unmethylated_count,
+ normal_methylated_count,
+ normal_unmethylated_count,
+ purity,
+ ploidy
+)
+
+# HDI function (most accurate)
+CAMDAC:::vec_HDIofMCMC_mt(
+ bulk_methylated_count,
+ bulk_unmethylated_count,
+ normal_methylated_count,
+ normal_unmethylated_count,
+ purity,
+ ploidy,
+ credMass=0.99
+)The germline sample is optional as, in the absence of patient-matched methylation data, you may already have an allele-specific CNA solutions for your bulk tumor. For example, this could be derived from bulk WGS of the same sampl.
+You can provide this data in tab-delimited text file as shown below. Importantly,:
+| chrom | +start | +end | +major_cn | +minor_cn | +purity | +ploidy | +
|---|---|---|---|---|---|---|
| chr1 | +1 | +400 | +2 | +1 | +0.67 | +3.5 | +
| chr1 | +401 | +1000 | +1 | +1 | +0.67 | +3.5 | +
To run CAMDAC with this CNA solution, pass attach the file to the tumor CamSample() object:
+library(CAMDAC)
+
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+cna_file <- system.file("testdata", "test.cna.txt", package = "CAMDAC")
+
+# Set config
+config <- CamConfig(outdir="./results", bsseq="wgbs", lib="pe", build="hg38", n_cores=10)
+
+# Create tumor object and attach CNA solution
+tumor <- CamSample(id="T", sex="XY", bam=b_tumor)
+attach_output(tumor, config, "cna", cna_file)
+
+# Define normal object(s) for deconvolution or differential methylation
+normal <- CamSample(id="N", sex="XY", bam=b_normal)
+
+# Run pipeline with CNA solution
+pipeline(
+ tumor=tumor,
+ germline=NULL,
+ infiltrates=normal,
+ origin=normal,
+ config=config
+)If no SNP file is present for the germline, CAMDAC will infer the copy number calls from the tumor sample alone. Here, the BAF is calculated by a threshold on the tumor BAF, and the LogR is calculated by taking the coverage relative to the median. These results are not as accurate as using a germline normal sample.
+You may already know where heterozygous SNPs lie for your sample, obviating the need for a tumor BAF threshold. In addition, you may have a proxy of the normal coverage for your platform, which is an improvement over taking the tumor median. You can provide this information by attaching a SNPs file to the germline CamSample object. The file should contain:
+| Field | +Description | +
|---|---|
| chrom | +Chromosome name | +
| POS | +Position of SNP | +
| BAF | +(optional) B-allele frequency at this SNP | +
| total_counts | +(optional) Total number of reads at this SNP | +
POS and total_counts are used to derive the BAF and the LogR respectively. We strongly recommend that total_counts is derived from a normal sample sequenced with the same bisulfite-sequencing assay as the tumor, and unmatched patient samples are acceptable.
+CAMDAC may be run to the copy number calling stage using the external heterozygous SNP file:
+
+library(CAMDAC)
+
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+snps_file <- system.file("testdata", "test.to.norm_pos.csv.gz", package = "CAMDAC")
+
+# Set config
+config <- CamConfig(outdir="./results", bsseq="wgbs", lib="pe", build="hg38", n_cores=10)
+
+# Create tumor object and attach CNA solution
+tumor <- CamSample(id="T", sex="XY", bam=b_tumor)
+attach_output(tumor, config, "cna", cna_file)
+
+# Define normal object(s) for deconvolution or differential methylation
+germline <- CamSample(id="G", sex="XY")
+attach_output(germline, config, "snps", snps_file)
+
+# Run pipeline with CNA solution
+pipeline(
+ tumor=tumor,
+ germline=germline,
+ infiltrates=NULL,
+ origin=NULL,
+ config=config
+)After this, we recommend inspecting the CNA results. If all is well, the pipeline() function can be repeated with the infiltrates and origin CamSamples to complete deconvolution and differential methylation respectively.
+CAMDAC can be used to detect allele-specific methylation (ASM) by phasing CpGs to heterozygous SNPs and deconvolving bulk methylation rates per allele.
+This tutorial steps through the ASM analysis pipeline (WGBS only):
+Results from this pipeline are found in the results directory under ‘PATIENT/AlleleSpecific’ and ‘PATIENT/Methylation’. See output file headings below for files and their content.
+The asm_pipeline() function runs CAMDAC-ASM analysis by generates the allele-specific copy number solution and heterozygous SNP loci, followed by deconvolution and differential ASM analysis:
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+regions <- system.file("testdata", "test_wgbs_segments.bed", package = "CAMDAC") # speed up tests
+
+tumor <- CamSample(id = "T", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "N", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./results", ref = "./pipeline_files", bsseq = "wgbs", lib = "pe", cores = 10,
+ min_cov = 1, # For test data
+ regions = regions
+)
+
+asm_pipeline(
+ tumor = tumor,
+ germline = normal,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)To run the ASM pipeline without BAM files, CAMDAC requires: - Each CamSample object has SNP loci - The tumor CamSample object has an allele-specific CNA solution - All CamSample objects have BAM files available for phasing
+CAMDAC-ASM requires a file of heterozygous SNP loci against which CpGs will be phased. This is a tab-delimited file with a header containing four fields:
+| Field | +Description | +
|---|---|
| chrom | +Chromosome name | +
| pos | +SNP loci position | +
| ref | +The reference allele (A/C/T/G) | +
| alt | +The alternate SNP allele (A/C/T/G) | +
First, attach your SNP loci file to the tumor object with attach_output(), then run asm_pipeline():
+# Setup CAMDAC samples
+tumor <- CamSample(id = "tumor", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "normal", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./results", ref = "./pipeline_files", bsseq = "wgbs", lib = "pe", cores = 10,
+ min_cov = 1, # For test data
+ regions = regions
+) # For arapid testing)
+
+# Add SNPs
+asm_snps_file <- system.file("testdata", "test_het_snps.tsv", package = "CAMDAC")
+attach_output(tumor, config, "asm_snps", asm_snps_file)
+attach_output(normal, config, "asm_snps", asm_snps_file)Next, CAMDAC requires the allele-specific copy number solution from the tumor, attached as follows:
+
+cna_file <- system.file("testdata", "test_cna.tsv", package = "CAMDAC")
+attach_output(tumor, config, "cna", cna_file)Finally, run the allele-specific methylation pipeline:
+
+asm_pipeline(
+ tumor = tumor,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)If you have already run the CAMDAC pipeline in tumor-normal mode, then the germline object’s SNP files will be used by default. The simplest run from BAM to ASM is shown below using matched normals for infiltrates and DMPs:
+
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+regions <- system.file("testdata", "test_wgbs_segments.bed", package = "CAMDAC") # speed up tests
+
+tumor <- CamSample(id = "T", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "N", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./test_results", bsseq = "wgbs", lib = "pe",
+ build = "hg38", n_cores = 10,
+ regions = regions,
+ min_cov = 1, # For test data
+ cna_caller = "ascat" # Battenberg always recommended, however ASCAT used here for rapid testing.
+)
+
+# Run main CAMDAC generate SNP files for ASM
+# Deconvolution skipped here for simplicity.
+pipeline(tumor, germline = normal, infiltrates = NULL, origin = NULL, config)
+
+# Run ASM pipeline
+asm_pipeline(
+ tumor = tumor,
+ germline = normal,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)** Allele-specific/ **
+vignettes("pipeline").** Methylation/ **
+This feature is currently described for CAMDAC-WGBS only.
+CAMDAC supports the use of multiple DNA methylation BAM files as a source of the normal infiltrates or normal cell of origin.
+To create a panel, process your BAM files with the CAMDAC allele counter:
+library(CAMDAC)
+
+# Get BAM files
+b_normal1 = system.file("inst/testdata/normal.bam")
+b_normal2 = system.file("inst/testdata/normal.bam")
+b_normal3 = system.file("inst/testdata/normal.bam")
+
+# Run allele counter
+for(file in c(b_normal1, b_normal2, b_normal3)){
+ prefix = fs::path_ext_remove(file)
+ outfile = paste0(prefix, ".all.SNPs.CG.csv.gz")
+ data = cmain_count_alleles(bam_file)
+ data.table::fwrite(data, outfile)
+}
+The allele counts files can then be merged into a single file for the panel containing methylation data for deconvolution:
+
+panel_counts <- fs::dir_ls(".", glob="*.SNPs.CG.csv.gz")
+panel <- panel_meth_from_counts(panel_counts)
+data.table::fwrite(panel, "panel.m.csv.gz")By default, panel counts are merged by summing the methylation read counts for each CpG site. You can customise the proportion of each sample that is used in the panel by specifying the ac_props argument in panel_meth_from_counts. To get the mean across each CpG site, simply pass equal proportions for each sample.
To run CAMDAC with your newly created panel, attach your panel to a CamSample object using the meth argument.
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+
+# Setup CAMDAC samples
+tumor <- CamSample(id="tumor", sex="XY", bam=b_tumor)
+normal <- CamSample(id="normal", sex="XY", bam=b_normal)
+config <- CamConfig(outdir="./results", ref="./pipeline_files", bsseq="wgbs", lib="pe", cores=10)
+
+# Setup panel sample
+panel <- CamSample(id="panel", sex="XY")
+panel_file <- system.file("testdata", "test_panel.m.csv.gz", package = "CAMDAC")
+attach_output(panel, config, "meth", panel_file)
+
+# Run CAMDAC with panel
+pipeline(
+ tumor=tumor,
+ germline=normal,
+ infiltrates=panel,
+ origin=panel,
+ config=config
+)If you have not started from BAM files, you can create a panel using a matrix of beta values:
+| sample1 | +sample2 | +sample3 | +
|---|---|---|
| 0.5 | +0.6 | +0.7 | +
| 0.4 | +0.5 | +0.6 | +
Additionally, a data frame specifying the positions of each CpG site in the beta value matrix is required. Here, start and end refer to the C and G of the CpG site respectively:
+| chrom | +start | +end | +
|---|---|---|
| chr1 | +100 | +101 | +
| chr1 | +200 | +201 | +
The matrix and CpG locations can be passed directly to the panel_meth_from_beta() function, along with settings.
+# Load beta values and chromosome positions
+ex <- system.file("testdata", "test_panel_from_beta.csv", package = "CAMDAC")
+data <- data.table::fread(ex)
+mat = data[, 4:ncol(data)] # Beta value matrix with 3 samples
+
+# Create panel from beta values
+panel_beta <- panel_meth_from_beta(
+ mat = mat,
+ chrom = data$chrom,
+ start = data$start,
+ end = data$end,
+ cov = 100,
+ props = c(0.1, 0.8, 0.1), # Proportions of each sample in panel
+ min_samples = 1,
+ max_sd = 1
+)As CAMDAC requires coverage at each CpG site to estimate uncertainty, the cov value is given to all CpG sites when building a panel from beta values. Additionally, if any beta values are missing from a sample, proportions are recalculated among the remaining samples as this is the only information available to build the panel for that site.
There are two experimental arguments that can be set to filter CpG sites from the panel:
+min_samples: The minimum number of samples that have to have a beta value for a CpG to be included in the panel. The idea here is if you have sparse data, you can skip sites where you aren’t confident in the panel. Set this to 1 to use any sample.
max_sd: Maximum standard deviation of beta values across samples a CpG must have to be included in the panel. The idea here is that when combining many bulk methylomes from the same tissue, sites with high variability reflect sample-specific differences and their averages are less reliable for use in a methylation panel.
CAMDAC produces several output files that visualise the copy number state. DNA methylation rates can be passed to external packages for visualisation. For a quick view of DMRs in R:
+
+library(data.table)
+library(ggplot2)
+library(CAMDAC)
+
+# Show DMPs around a region
+dmr <- data.table(dmr) # Object from CAMDAC output *annotated_DMRs.fst
+dmp <- data.table(dmp) # Object from CAMDAC *results_per_CpG.fst
+chrome <- dmr[1, ]$chrom
+starte <- dmr[1, ]$start
+ende <- dmr[1, ]$end
+offset <- 1000 # Offset 1kB either side of region
+dmp <- data.table(dmp)
+dm_regions <- dmp[chrom == as.character(chrome) & start >= (starte - offset) & end <= (ende + offset), ]
+
+# Using ggplot, generate a geom where the m_t values are
+tplt <- ggplot(dm_regions, aes(x = start)) +
+ geom_point(aes(y = m_t), color = "skyblue") +
+ geom_point(aes(y = m_n), color = "grey") +
+ geom_vline(aes(xintercept = start, color = DMP_t)) +
+ theme_classic() +
+ scale_color_manual(values = c("skyblue", "blue")) +
+ scale_y_continuous(limits = c(0, 1)) +
+ geom_vline(xintercept = c(start, end), color = "red", linetype = "dashed") +
+ labs(x = dm_regions$chrom[[1]])
+tplt
CAMDAC DMR Visualization
+Here, light blue dots are the pure tumour, while light-grey are the normal. The red dash is the DMR region and the vertical lines are hypomethylated DMPs (blue) and hypermethylated DMPs (light blue).
+introduction.RmdSolid tumours typically contain both cancer and admixed normal contaminating cells, which confounds the analysis of bulk cancer methylomes from bisulfite sequencing. To address these issues we present CAMDAC, a tool for Copy-number Aware Methylation Deconvolution Analysis of Cancer.
+In brief, we show that the bulk tumour methylation rate (\(m_b\)) can be expressed as a weighted sum of the methylation rates of the tumour cells and normal contaminants, accounting for tumour purity and copy number (Figure 1). We derive purity and copy number estimates directly from bulk tumour RRBS data, leveraging somatic copy number aberration calls from ASCAT or Battenberg. We use bulk tissue- and sex-matched normal samples as proxy for the normal tumour-infiltrating cells (\(m_{n,i}\)), and obtain \(m_b\) from the bulk tumour data itself. This provides all the necessary information to extract the pure tumour methylation rate (\(m_t\)).
+
Figure 1. CAMDAC principles and key variables. Adapted from Larose Cadieux et al., 2020.
+
In Larose Cadieux et al., 2020, we obtained bulk tumour RRBS data from surgically resected lung cancers and patient-matched tumour-adjacent normal lung samples. Normal samples may be used for copy number profiling, as proxy a for the normal tumour-infiltrating cells (\(m_{n,i}\)), and as a proxy for the tumour cell of origin (\(m_{n,o}\)). Here, \(m_{n,i}\) is needed for bulk tumour methylation rate deconvolution and \(m_{n,o}\) is required for differential methylation analyses (Figure 2). In non-small cell lung cancer, we demonstrate that patient-matched tumour-adjacent normal is a suitable proxy for all normals, i.e. \(m_{n,i} \approx m_{n,o}\) (Larose Cadieux et al., 2020).

Figure 2. Key input and output data for CAMDAC
+
If the patient-matched tumour-adjacent normal tissue is not available, a tissue- and sex-matched normal may provide a substitute for the tumour-infiltrating normal cells (Figure 2). If the tissue-matched normal is a poor representative of the cell of origin, a different proxy may be used for differential methylation analysis.
The purified tumour methylation rates allow for accurate differential methylation analysis, both between tumour and normal cells and, in the case of multi-region sequencing, between different tumour samples. The deconvoluted methylation profiles accurately inform inter- and intra-tumour sample relationships and could enable the timing of copy number gains and (epi)mutations in tumour evolution. This is explained in more detail in Larose Cadieux et al., 2020.
+At time of writing, CAMDAC is compatible with human Msp1 digested single-end directional reduced representation bisulfite sequencing (RRBS) data and whole genome bisulfite sequencing (WGBS) data. The input must be in binary alignment map (BAM) format. Bases should be quality and adapter trimmed and PCR duplicates should be removed. BAM files may be aligned to hg19, hg38, GRCH37 and GRHCH38 reference human genome builds.
output.RmdThe CAMDAC pipeline returns a structured directory at the outdir from the CamConfig() object. The pipeline returns files unique to the RRBS and WGBS modules with the general structure:
└── <CamSample.patient_id>
+ ├── Allelecounts
+ │ ├── <CamSample.id>
+ ├── Copynumber
+ │ ├── <CamSample.id>
+ └── Methylation
+ └── <CamSample.id>
+The sections below describe each results file in more detail. Next, see vignette("questions") for frequently asked questions or vignette("experimental") for details on experimental CAMDAC features.
results/
+└── P
+ ├── Allelecounts
+ │ ├── N
+ │ │ └── P.N.SNPs.CpGs.all.sorted.RData
+ │ └── T
+ │ └── P.T.SNPs.CpGs.all.sorted.RData
+ ├── Copy_number
+ │ ├── N
+ │ │ ├── fragment_length_histogram.pdf
+ │ │ ├── msp1_fragments_RRBS.RData
+ │ │ ├── P_N_normal_SNP_data.pdf
+ │ │ ├── P.N.SNPs.RData
+ │ │ └── Rplots.pdf
+ │ └── T
+ │ ├── fragment_length_histogram.pdf
+ │ ├── msp1_fragments_RRBS.RData
+ │ ├── P_T_SNP_data.pdf
+ │ ├── P.T.ACF.and.ploidy.txt
+ │ ├── P.T.ascat.bc.RData
+ │ ├── P.T.ascat.frag.RData
+ │ ├── P.T.ascat.output.RData
+ │ ├── P.T.ASCATprofile.png
+ │ ├── P.T.ASPCF.png
+ │ ├── P.T.BAF.PCFed.txt
+ │ ├── P.T.germline.png
+ │ ├── P.T.LogR.PCFed.txt
+ │ ├── P.T.rawprofile.png
+ │ ├── P.T.SNPs.RData
+ │ ├── P.T.sunrise.png
+ │ ├── P.T.tumour.png
+ │ └── Rplots.pdf
+ └── Methylation
+ ├── N
+ │ ├── dt_normal_m.RData
+ │ └── P_N_methylation_rate_summary.pdf
+ └── T
+ ├── CAMDAC_DMPs.bed
+ ├── CAMDAC_purified_tumour.bed
+ ├── CAMDAC_results_per_CpG.RData
+ ├── P_T_DMP_stats.txt
+ ├── P_T_methylation_rate_summary.pdf
+ ├── purified_tumour.RData
+ └── tumour_versus_normal_methylomes.pdf
+| File | +Description | +
|---|---|
P.T.SNPs.CpGs.all.sorted.RData |
+Allele counts for a sample. Generated by processing BAM file | +
P.T.ascat.output.RData |
+ASCAT copy number results | +
P.T.ASCATprofile.png |
+ASCAT copy number profile | +
dt_normal_m.RData |
+Bulk normal DNA methylation data | +
purified_tumour.RData |
+CAMDAC-purified DNA methylation rates | +
CAMDAC_results_per_CpG.fst |
+CAMDAC deconvolution and differential methylation results | +
CAMDAC outputs are written in the directory given by config$outdir in the format PATIENT/DATASET/SAMPLE/:
└── P
+ ├── Allelecounts
+ │ ├── N
+ │ │ └── P.N.SNPs.CpGs.all.sorted.csv.gz
+ │ └── T
+ │ └── P.T.SNPs.CpGs.all.sorted.csv.gz
+ ├── Copynumber
+ │ ├── N
+ │ │ └── P.N.SNPs.csv.gz
+ │ └── T
+ │ ├── ascat
+ │ ├── battenberg
+ │ ├── P.T.cna.txt
+ │ ├── P.T.SNPs.csv.gz
+ │ └── P.T.tnSNP.csv.gz
+ └── Methylation
+ ├── N
+ │ └── P.N.m.csv.gz
+ └── T
+ ├── P.T.CAMDAC_annotated_DMRs.fst
+ ├── P.T.CAMDAC_results_per_CpG.fst
+ ├── P.T.m.csv.gz
+ └── P.T.pure.csv.gz
+| File | +Description | +
|---|---|
P.T.SNPs.CpGs.all.sorted.csv.gz |
+Allele counts for a sample. Generated by processing BAM file | +
P.T.SNPs.csv.gz |
+SNP counts for a sample. | +
P.T.cna.txt |
+CAMDAC CNA result | +
P.T.m.csv.gz |
+Bulk methylation data | +
P.T.m.pure.csv.gz |
+CAMDAC-deconvolved methylation data | +
P1.T.CAMDAC_results_per_CpG.fst |
+CAMDAC differentially methylated cytosines | +
P1.T.CAMDAC_annotated_DMRs.fst |
+CAMDAC differentially methylated regions | +
It is possible to manually override outputs for runs. See vignette("questions") for more details.
pipeline.RmdThe entry-point to CAMDAC is the pipeline() function which expects a CamConfig() object and four CamSample() objects representing:
The same normal sample may be passed repeatedly for the germline, infiltrates or origin, depending on your experimental design. See ?pipeline for more details.
+library(CAMDAC)
+
+# Path to BAM files
+tumor_bam <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+normal_bam <- system.file("testdata", "normal.bam", package = "CAMDAC")
+
+# Select samples for basic tumor-normal analysis
+tumor <- CamSample(id = "T", sex = "XY", bam = tumor_bam)
+normal <- CamSample(id = "N", sex = "XY", bam = normal_bam)
+
+# Configure pipeline
+config <- CamConfig(
+ outdir = "./results", bsseq = "rrbs", lib = "pe",
+ build = "hg38", refs = "./refs", n_cores = 1, cna_caller = 'ascat'
+)
+
+# Run CAMDAC
+CAMDAC::pipeline(
+ tumor, germline = normal, infiltrates = normal, origin = normal, config
+)Next, see vignette("output") for a detailed summary of CAMDAC results files.
questions.RmdIdeally, CAMDAC is run with a matched normal sample from which to derive heterozygous germline SNPs for copy number estimation. In the absence of matched normals, a panel of sex- and tissue-matched normal samples may be used by averaging DNA methylation rates from multiple patients. See vignette("experimental") for more information.
Please raise an issue on GitHub to request files for a new reference genome.
+When calling pipeline if you do not give a normal infiltrate or cell of origin, the pipeline skips deconvolution and differential methylation respectively. This may be useful to run a quick first-pass to find and refit copy number solutions. When CAMDAC has found a solution and is rerun with the same tumor, config, and normal, the infiltrates and cell_of_origin arguments will continue the pipeline where it left off. The entire pipeline can be re-run be deleting the output directory or setting overwrite=TRUE in the CamConfig.
The simplest way is to call pipeline with overwrite=FALSE in your config, giving the right normal sample for your step. Additionally, you CamConfig must run with the same output directory.
If for any reason, you have changed the output directory structure from previous run, you can initiate CAMDAC by manually passing outputs to CamSample objects. See the vignette vignette("output") for more information.
Finally, you can run the cmain_* functions used by pipeline() directly. For example, to run the deconvolution step, you can call cmain_deconvolve_methylation().
If you want to use an external purity and ploidy solution, simply pass a CNA file that has only the purity and ploidy fields. Additionally, set refit==TRUE in the CamConfig and CAMDAC will use this to refit the sample.
To analyse specific genomic regions, you may pass a BED file to CAMDAC config:
+
+CamConfig(outdir=".", ref="./pipeline_files", regions="regions.bed")CAMDAC will merge any overlapping regions prior to analysis.
+If you have outputs from a previous run, you can manually assign them to a CAMDAC object. This overwrites the expected path for that output type, allowing the pipeline to run with this data instead of computing it. Use the attach_output function, passing one of three arguments:
counts: CAMDAC allele counts *.SNP.CpGs.all.sorted.csv.gz filesnps: CAMDAC sample SNP counts *.SNPs.csv.gz filemeth: CAMDAC bulk methylation *.m.csv.gz filecna: CAMDAC CNA *.cna.txt filepure: CAMDAC deconvolved methylation *.m.pure.csv.gz fileFor example, to attach a previous counts file to a CAMDAC object:
+
+library(CAMDAC)
+tumor <- CamSample(id = "T", sex = "XY", bam = NULL)
+config <- CamConfig(outdir = tempdir(), build="hg38", bsseq="wgbs", lib="pe")
+counts_file <- system.file("testdata", "test.SNPs.CpGs.all.sorted.csv.gz", package = "CAMDAC")
+tumor <- attach_output(tumor, config, "counts", counts_file)The CAMDAC pipeline can now access the file in the expected location at config$outdir.
setup.RmdFrom the R console, install CAMDAC from github:
+
+install.packages("remotes")
+remotes::install_github("VanLoo-lab/CAMDAC")CAMDAC requires custom annotation files for RRBS and WGBS analysis, available at the Zenodo repository: (10565423). An R convenience function is provided to download these files:
+
+CAMDAC::download_pipeline_files(bsseq = "rrbs", directory = "./refs")
+CAMDAC::download_pipeline_files(bsseq = "wgbs", directory = "./refs")Now, you’re ready to run CAMDAC! Next, see vignette("pipeline").
CAMDAC searches for pipeline files in the following order:
+CamConfig())We recommend that you set the environment variable CAMDAC_PIPELINE_FILES to the directory where you downloaded the files. This will allow CAMDAC to find the files automatically whenever you load R.
From a Unix terminal:
+++echo “CAMDAC_PIPELINE_FILES=$(realpath R)” >> ~/.Renviron
+
CAMDAC-RRBS
+CAMDAC WGBS
+java: To run CAMDAC on WGBS data, we leverage Battenberg which requires the java command-line utility. Download Java from https://openjdk.org/.technical.Rmd
In this section, we provide a high-level summary of the CAMDAC pipeline, which covers six key steps:
For a full outline and validation of CAMDAC, please see Larose Cadieux et al. (2020) bioRxiv.
+Take a hypothetical female patient with primary tumour sample ID “T1” and normal-adjacent sample ID “N1”. First, CAMDAC takes the sequencing alignment files from each sample using the CamSample() functions, users should provide the full path and file name for the RRBS or WGBS binary mapping alignments (.bam) files for input samples, and use the CamConfig() sample to indicate whether they are aligned hg19, hg38, GRCH37 or GRCH38. Bases should be quality and adapter trimmed and PCR duplicates should be removed. Please ensure that the bam file is sorted and indexed.
CAMDAC employs an allele counter module to count SNP and CpG (methylation) alleles for downstream analysis. SNP counts are performed at 1000 genome SNP positions, and CpG alleles are counted using dinucleotides. To speed up the computation, we leverage a reference RRBS and WGBS genome files listing all genomic regions supported by the respective platforms.
+By default, the read mapping quality filter is set to mq>=0 as default in CamConfig(). Mapping quality scores from bisulfite sequencing aligners may be biased against the alternate allele for reads with polymorphisms. Please review the mapping quality distribution of your data to determine if it is appropriate to increase this setting.
If the function is successful, a signle file output with the suffix “SNPs.CpGs”. This file carries compiled SNP and methylation information with the following columns:
+
Figure. Formatted SNP and methylation information
+Each row is either a CG locus (and CCGG for RRBS) and/or a 1000g SNP position. These can be distinguished by the width column. While polymorphic CG/CCGG have the same width as their non-polymorphic counterpart, they are easily identified by looking at the POS, ref, alt and other SNP-informative columns.
+For each SNP locus, 1000 Genomes genomic coordinate and reference and alternate alleles are listed under POS, ref and alt columns. The total_counts is the sum of alt_counts and ref_counts, which including all informative strand-specific allele counts. For example, at \(C>T\) SNPs, only the reverse strand allows to distinguish between the (un)methylated reference and the alternate allele and thus all forward read counts would be excluded from the total_counts column, but included in the total_depth. The SNP type column is only added to the patient-matched normal, which is used to assign SNP genotypes as either Homozygous or Heterozygous based on internal B-allele frequency (BAF) cut-offs.
+M, UM, total_counts_m, and m are the counts methylated, counts unmethylated, the total counts (un)methylated and the methylation rate, respectively. Methylation rates are calculated per CG allele, meaning that at polymorphic CpGs, only the CG-forming allele counts are considered. CAMDAC methylation rates are therefore polymorphism-independent.
+For CCGG loci found in RRBS, the CCGG column indicates the number of fragments with a 5’ end at this CCGG loci. This number may be 0 at polymorphic CCGG loci homozygous for the CCGG-destroying allele. Furthermore, for RRBS, MspI fragment boundaries are determined from the aligned reads and MspI fragment the size distribution is visualised for quality assessment in the file fragment_length_histogram.pdf. You should observe 3 disctinct peaks in the fragment length distribution. This is characteristic of human RRBS libraries and originates from MspI containing micro-satellite repeats of distinct lengths. The MspI fragment boundaries and their GC content are saved as an .RData object and used downstream in RRBS copy number profiling.
+
Figure. MspI fragment size distribution
+B-allele frequencies at heterozygous SNPs are leveraged to calculate pure tumour copy number aberrations using either ASCAT.m for RRBS or Battenberg.m for WGBS. These tools are inspired from ASCAT (Van Loo et al., 2010) and Battenberg (Nik-Zainal et al., 2012). If sucessful, CAMDAC writes copy number output to the “Copy_number” directory.
+A SNPs file lists the heterozygous SNPs selected for copy number analysis, resulting in a table where each row is a 1000g SNP position with minimum coverage defined by the germline sample with a minimum coverage set by the min_normal argument. The total_counts column is the total informative read counts. For example, at C\(>\)T SNPs, only the reverse strand allows to distinguish between the unmethylated reference and the alternate allele and thus, forward read counts would not contribute to the total_counts and the BAF (B-allele frequency calculation). rBAF is randomly assigned BAF or 1-BAF to remove biases against the alternate allele in downstream tumour copy number profiling. All read counts however contribute to the total_depth which is used for LogR calculation, a measure of total coverage. Genotyping is performed and assignments stored under type.
For the RRBS pipeline, we provide an experimental feature to visualise the magnitude of biases against alternate of (B)-alleles. The number of homozygous to heterozygous SNPs is depicted and any biases in coverage against the latter can be evaluated. Due to being biases for CpG-rich genomic regions, a typical RRBS sample should show a high ratio of C\(>\)T SNPs. We note that C\(>\)T and A\(>\)G germline heterozygous SNPs will have roughly half the coverage of the 4 types of SNPs.
+
Figure Normal SNP data QC
+In addition to the above-mentionned columns, we also adjust for biases in the tumour LogR. The LogR is a normalised measure of tumour coverage used by ASCAT.m and Battenberg.m for copy number profiling together with the BAF. The covariates used for LogR correction are:
+Next, the standard ASCAT or Battenberg output are then generated. All files have the dot-separated patient and sample IDs as prefix. In addition, we plot the BAF and LogR. In the BAF profiles, heterozygous SNPs are highlighted in red. The BAF and LogR tracks are then segmented by the respective tools. The segmentation is then analysed to determine the optimal tumour purity and ploidy solution via a grid search (see sunrise plot). Raw and rounded allele-specific copy number segments are provided as output png images.
+Finally, the purity, ploidy, number of heterozygous and homozygous 1000g SNP positions and median tumour and normal SNP depth are saved for each tumour sample. For RRBS, summary SNP data is plotted and saved as a pdf with filename "*_SNP_data.pdf*" and may help you troubleshoot your data.
+
Figure. Tumour SNP data summary
+As part of the allele counting step, CAMDAC calculates bulk DNA methylation rates for each input sample. For the patient- and tissue-matched normal sample “N1”, the methylation data columns have the suffix is \(x = n\), since \(m_{n,i} \sim m_{n,o}\). Where \(m_{n,i} \neq m_{n,o}\), the suffix is set to \(x = n\_i\) for the normal infiltrates and \(x = n\_o\) for the normal cell of origin proxy sample. The uncertainty on \(m_{x}\) is computed as the lower and upper boundaries of the 99% Highest Density Interval (HDI) are stored under columns \(m_{x,low}\) and \(m_{x,high}\).
+
Figure. Normal methylation output.
+In the normal sample methylation output directory, you will find a pdf with methylation data summary and QC (RRBS only). We expect DNA methylation rates to sit near 0 and 1. CAMDAC calculates DNA methylation rates in a polymorphism-independent manner, meaning that the CG-destroying allele at a heterozygous CpG does not contribute to its methylation rate. The minimum coverage threshold applied to CpG sites is based on the CpG allele read depth, so any heterozygous SNPs present at the CG location may be removed due to insufficient coverage.
+
Figure. Normal methylation rate QC.
+At this stage, CAMDAC has obtained methylation rates for both the normal infiltrates and bulk tumour, as well as tumour copy number and purity estimates. The DNA methylation profile of the normal-adjacent samples may be used as a proxy for the methylation rate of tumour-infiltrating normal cells (\(m_{n,i}\)). We have all the necessary information to obtain CAMDAC pure tumour methylation rates, \(m_t\).
+In the Methylation/ output directory, CpG copy number and purified tumour methylation data are written to output CSV files. Header fields include:
+CAMDAC-deconvoluted methylation rate can have any values between 0 and 1 while the range of bulk tumour methylation rates is driven by tumour DNA content. In the bulk tumour profiles, bi-allelic tumour-normal differentially methylated positions appear at intermediate methylation values while after purification, they form a peak near 0 or 1 for hypo- and hypermethylated positions, respectively.

Figure. Tumour versus normal methylation rates from before and after CAMDAC.
+For tumour-normal differential methylation analysis, CAMDAC expects a DNA methylation profile representing the tumour cell of origin (\(m_{n,o}\)). In this hypothetical example, we set the normal sample N1 as the cell of origin. Leveraging CAMDAC purified methylomes, we then obtain differentially methylated positions and regions.
+Differential DNA methylation is detected with a minimum tumour-normal methylation rate difference (effect size, where \(\delta\beta\) >= 0.2) and a probability threshold, representing the probability that the tumour and normal beta distributions do not overlap. Both variables are used for calling differentially methylated positions (DMPs).
+Next, CAMDAC builds on DMP calls to call DMRs. To identify differentially methylated regions (DMRs), we group CpGs into bins and look for clusters with at least 5 DMPs (min_DMP_counts_in_DMR=5), 4 of which must be consecutive (min_consec_DMP_in_DMR=4). After completion, this function generates a pure tumor methylation file (CAMDAC_results_per_CpG.RData for RRBS or pure.csv.gz for WGBS) in the CAMDAC methylation output directory. This R object is a combination of all CAMDAC results per CpG with DMP information included:
+The ratio of hyper- to hypomethylated DMRs varies across genomic regions is reflected by the tumour-normal methylation rate difference.
+
Figure. DMR summary data.
+CAMDAC outputs will be stored at the user-defined project outdir variable given to the configuration (CamConfig()). A patient folder is created at this path with directory name set to patient_id. This will contain 3 subdirectories: Allelecounts, Copy_number and Methylation, with further sub-directories created for each of a given patient’s samples.
With CAMDAC differential methylation calls in hand, users may choose to look for recurrently aberrated loci across their cohort. Note that tumour-tumour DMPs can be easily identified by looking for overlap between the 99% HDIs for CAMDAC pure tumour methylation rates between samples (99% HDI \(\subseteq\) [m_t_low,m_t_high]).
+Clustering* analyses can also easily be performed by the user using well-established R packages such as ‘pvclust’ for hierarchical clustering with bootstrap and ‘umap’ (uniform manifold approximation and projection) for non-linear dimensionality reduction. Clustering of pure tumour methylation rates at promoter DMRs across large cohorts by ‘umap’ may reveal histology and/or sex-driven clusters as described in non-small cell lung cancer Larose Cadieux et al., 20201.
+For multi-region data, sample tree reconstruction by neighbour joining leveraging CAMDAC pure tumour methylation rates at hypermethylated DMPs in at least on sample, subset to loci confidently unmethylated in the normal cell of origin (m_n_high<0.2), can reveal inter-sample relationships, as demonstrated in non-small cell lung cancer Larose Cadieux et al., 20201.
+When running gene-set enrichment analysis (GSEA) on CAMDAC DMR calls, gene sets should be limited to those genes with promoters covered by RRBS. It may be desirable to subset DMR calls to hypermethylated promoter-associated CpG Islands given that methylation at these loci is most correlated with expression.
+Users may leverage normal, deconvoluted tumour methylation rates and tumour-normal DMP calls to separate clonal mono- and bi-allelic from subclonal bi-allelic methylation changes to shed light into tumour evolutionary histories Larose Cadieux et al., 20201. The allele-specific CAMDAC module will be made available in future releases.
+pkgdown::build_articles(override=list(destination='docs/html'))
+pkgdown::build_site(examples=FALSE, devel=TRUE, lazy=TRUE, preview=FALSE, override=list(destination='docs/html'))Currently running in Docker. Usied
+# Start server
+colima start --cpu 4 --memory 13
+
+# Build docker image
+docker build -t camdac .
+docker buildx build --platform linux/amd64 -t nmensah5/camdac:latest .
+
+# Run and enter image interactive mode
+docker run -it -v "$(pwd):/app" camdac:latest bash
+docker run -it -v "$(pwd):/app" --entrypoint=/bin/bash 4de139ba6ced
+
+# Within the container, start R and load CAMDAC files
+R
+devtools::load_all()
+docker buildx build --platform linux/amd64 -t nmensah5/camdac:latest .
+docker run -it -v "$(pwd):/app" nmensah5/camdac-env:latest bash
+# MIT License + +Copyright (c) 2020 CAMDAC + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. ++ +
contributing.RmdTo contribute to CAMDAC, fork the repository and install the development dependencies with remotes::install_dev_deps('.').
After making your changes, run the test and build commands listed below, then submit a pull request with the changes on your fork.
+
+library(devtools)
+
+# Install dev dependencies
+devtools::install_dev_deps("VanLoo-lab/CAMDAC")
+
+# Update docs
+devtools::document()
+
+# Run tests
+devtools::test()
+
+# Build readme
+rmarkdown::render('README.Rmd', output_format='github_document', output_file='README.md')
+
+# Check package builds
+devtools::check()
+
+# Build documentation
+pkgdown::build_site(examples=FALSE, devel=TRUE, lazy=TRUE, preview=FALSE)
+pkgdown::preview_site() # To view. Or: python3 -m http.server --directory docs 8000
+
+# Commit changes on the docs/ folder before submitting experimental.RmdThis document describes experimental features of the CAMDAC package. These features are not yet fully tested and may change in future releases. The following features are currently under development for the WGBS pipeline only:
+The CAMDAC equation can be used to infer pure tumour DNA methylation rates, provided the following information is available per CpG:
+Here is an example for 5 CpGs from a single sample. Note: the normal copy number state is assumed diploid (2) in humans:
+
+
+# Set parameters
+bulk = c(0.3, 0.5, 0.2, 0.1, 0.9)
+normal = c(0.3, 0.9, 0.1, 0.7, 0.5)
+ploidy = c(2, 2, 1, 3, 4)
+purity = 0.8
+
+# Deconvolve methylation rates
+pure_meth = CAMDAC:::calculate_mt(bulk, normal, purity, ploidy)
+
+# Set clean rates based on threshold
+pure_meth_clean = dplyr::case_when(
+ pure_meth < 0 ~ 0,
+ pure_meth > 1 ~ 1,
+ TRUE ~ pure_meth
+)After deconvolution, it may be useful to estimate the CpG coverage in the deconvolved tumour sample. Additionally, the highest density interval (HDI) of the methylation rate may be informative for quality control. These metrics can be calculated given additional information on bulk methylated and unmethylated read counts:
+
+
+# Optional: calculate effective coverage of the tumour
+# # Requires coverage per CpG in the bulk sample
+bulk_coverage = c(10, 20, 5, 15, 30)
+pure_effective_coverage = CAMDAC:::calculate_mt_cov(bulk_coverage, purity, ploidy)
+
+# Optional: calculate the HDI of the pure tumour methylation rate
+bulk_methylated_count = c(3, 10, 1, 2, 27)
+bulk_unmethylated_count = c(7, 10, 4, 13, 3)
+normal_methylated_count = c(3, 9, 1, 5, 2)
+normal_unmethylated_count = c(7, 11, 3, 8, 3)
+
+# HDI function (fast)
+CAMDAC:::hdi_norm_approx(
+ bulk_methylated_count,
+ bulk_unmethylated_count,
+ normal_methylated_count,
+ normal_unmethylated_count,
+ purity,
+ ploidy
+)
+
+# HDI function (most accurate)
+CAMDAC:::vec_HDIofMCMC_mt(
+ bulk_methylated_count,
+ bulk_unmethylated_count,
+ normal_methylated_count,
+ normal_unmethylated_count,
+ purity,
+ ploidy,
+ credMass=0.99
+)The germline sample is optional as, in the absence of patient-matched methylation data, you may already have an allele-specific CNA solutions for your bulk tumor. For example, this could be derived from bulk WGS of the same sampl.
+You can provide this data in tab-delimited text file as shown below. Importantly,:
+| chrom | +start | +end | +major_cn | +minor_cn | +purity | +ploidy | +
|---|---|---|---|---|---|---|
| chr1 | +1 | +400 | +2 | +1 | +0.67 | +3.5 | +
| chr1 | +401 | +1000 | +1 | +1 | +0.67 | +3.5 | +
To run CAMDAC with this CNA solution, pass attach the file to the tumor CamSample() object:
+library(CAMDAC)
+
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+cna_file <- system.file("testdata", "test.cna.txt", package = "CAMDAC")
+
+# Set config
+config <- CamConfig(outdir="./results", bsseq="wgbs", lib="pe", build="hg38", n_cores=10)
+
+# Create tumor object and attach CNA solution
+tumor <- CamSample(id="T", sex="XY", bam=b_tumor)
+attach_output(tumor, config, "cna", cna_file)
+
+# Define normal object(s) for deconvolution or differential methylation
+normal <- CamSample(id="N", sex="XY", bam=b_normal)
+
+# Run pipeline with CNA solution
+pipeline(
+ tumor=tumor,
+ germline=NULL,
+ infiltrates=normal,
+ origin=normal,
+ config=config
+)If no SNP file is present for the germline, CAMDAC will infer the copy number calls from the tumor sample alone. Here, the BAF is calculated by a threshold on the tumor BAF, and the LogR is calculated by taking the coverage relative to the median. These results are not as accurate as using a germline normal sample.
+You may already know where heterozygous SNPs lie for your sample, obviating the need for a tumor BAF threshold. In addition, you may have a proxy of the normal coverage for your platform, which is an improvement over taking the tumor median. You can provide this information by attaching a SNPs file to the germline CamSample object. The file should contain:
+| Field | +Description | +
|---|---|
| chrom | +Chromosome name | +
| POS | +Position of SNP | +
| BAF | +(optional) B-allele frequency at this SNP | +
| total_counts | +(optional) Total number of reads at this SNP | +
POS and total_counts are used to derive the BAF and the LogR respectively. We strongly recommend that total_counts is derived from a normal sample sequenced with the same bisulfite-sequencing assay as the tumor, and unmatched patient samples are acceptable.
+CAMDAC may be run to the copy number calling stage using the external heterozygous SNP file:
+
+library(CAMDAC)
+
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+snps_file <- system.file("testdata", "test.to.norm_pos.csv.gz", package = "CAMDAC")
+
+# Set config
+config <- CamConfig(outdir="./results", bsseq="wgbs", lib="pe", build="hg38", n_cores=10)
+
+# Create tumor object and attach CNA solution
+tumor <- CamSample(id="T", sex="XY", bam=b_tumor)
+attach_output(tumor, config, "cna", cna_file)
+
+# Define normal object(s) for deconvolution or differential methylation
+germline <- CamSample(id="G", sex="XY")
+attach_output(germline, config, "snps", snps_file)
+
+# Run pipeline with CNA solution
+pipeline(
+ tumor=tumor,
+ germline=germline,
+ infiltrates=NULL,
+ origin=NULL,
+ config=config
+)After this, we recommend inspecting the CNA results. If all is well, the pipeline() function can be repeated with the infiltrates and origin CamSamples to complete deconvolution and differential methylation respectively.
+CAMDAC can be used to detect allele-specific methylation (ASM) by phasing CpGs to heterozygous SNPs and deconvolving bulk methylation rates per allele.
+This tutorial steps through the ASM analysis pipeline (WGBS only):
+Results from this pipeline are found in the results directory under ‘PATIENT/AlleleSpecific’ and ‘PATIENT/Methylation’. See output file headings below for files and their content.
+The asm_pipeline() function runs CAMDAC-ASM analysis by generates the allele-specific copy number solution and heterozygous SNP loci, followed by deconvolution and differential ASM analysis:
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+regions <- system.file("testdata", "test_wgbs_segments.bed", package = "CAMDAC") # speed up tests
+
+tumor <- CamSample(id = "T", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "N", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./results", ref = "./pipeline_files", bsseq = "wgbs", lib = "pe", cores = 10,
+ min_cov = 1, # For test data
+ regions = regions
+)
+
+asm_pipeline(
+ tumor = tumor,
+ germline = normal,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)To run the ASM pipeline without BAM files, CAMDAC requires: - Each CamSample object has SNP loci - The tumor CamSample object has an allele-specific CNA solution - All CamSample objects have BAM files available for phasing
+CAMDAC-ASM requires a file of heterozygous SNP loci against which CpGs will be phased. This is a tab-delimited file with a header containing four fields:
+| Field | +Description | +
|---|---|
| chrom | +Chromosome name | +
| pos | +SNP loci position | +
| ref | +The reference allele (A/C/T/G) | +
| alt | +The alternate SNP allele (A/C/T/G) | +
First, attach your SNP loci file to the tumor object with attach_output(), then run asm_pipeline():
+# Setup CAMDAC samples
+tumor <- CamSample(id = "tumor", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "normal", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./results", ref = "./pipeline_files", bsseq = "wgbs", lib = "pe", cores = 10,
+ min_cov = 1, # For test data
+ regions = regions
+) # For arapid testing)
+
+# Add SNPs
+asm_snps_file <- system.file("testdata", "test_het_snps.tsv", package = "CAMDAC")
+attach_output(tumor, config, "asm_snps", asm_snps_file)
+attach_output(normal, config, "asm_snps", asm_snps_file)Next, CAMDAC requires the allele-specific copy number solution from the tumor, attached as follows:
+
+cna_file <- system.file("testdata", "test_cna.tsv", package = "CAMDAC")
+attach_output(tumor, config, "cna", cna_file)Finally, run the allele-specific methylation pipeline:
+
+asm_pipeline(
+ tumor = tumor,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)If you have already run the CAMDAC pipeline in tumor-normal mode, then the germline object’s SNP files will be used by default. The simplest run from BAM to ASM is shown below using matched normals for infiltrates and DMPs:
+
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+regions <- system.file("testdata", "test_wgbs_segments.bed", package = "CAMDAC") # speed up tests
+
+tumor <- CamSample(id = "T", sex = "XY", bam = b_tumor)
+normal <- CamSample(id = "N", sex = "XY", bam = b_normal)
+config <- CamConfig(
+ outdir = "./test_results", bsseq = "wgbs", lib = "pe",
+ build = "hg38", n_cores = 10,
+ regions = regions,
+ min_cov = 1, # For test data
+ cna_caller = "ascat" # Battenberg always recommended, however ASCAT used here for rapid testing.
+)
+
+# Run main CAMDAC generate SNP files for ASM
+# Deconvolution skipped here for simplicity.
+pipeline(tumor, germline = normal, infiltrates = NULL, origin = NULL, config)
+
+# Run ASM pipeline
+asm_pipeline(
+ tumor = tumor,
+ germline = normal,
+ infiltrates = normal,
+ origin = normal,
+ config = config
+)** Allele-specific/ **
+vignettes("pipeline").** Methylation/ **
+This feature is currently described for CAMDAC-WGBS only.
+CAMDAC supports the use of multiple DNA methylation BAM files as a source of the normal infiltrates or normal cell of origin.
+To create a panel, process your BAM files with the CAMDAC allele counter:
+library(CAMDAC)
+
+# Get BAM files
+b_normal1 = system.file("inst/testdata/normal.bam")
+b_normal2 = system.file("inst/testdata/normal.bam")
+b_normal3 = system.file("inst/testdata/normal.bam")
+
+# Run allele counter
+for(file in c(b_normal1, b_normal2, b_normal3)){
+ prefix = fs::path_ext_remove(file)
+ outfile = paste0(prefix, ".all.SNPs.CG.csv.gz")
+ data = cmain_count_alleles(bam_file)
+ data.table::fwrite(data, outfile)
+}
+The allele counts files can then be merged into a single file for the panel containing methylation data for deconvolution:
+
+panel_counts <- fs::dir_ls(".", glob="*.SNPs.CG.csv.gz")
+panel <- panel_meth_from_counts(panel_counts)
+data.table::fwrite(panel, "panel.m.csv.gz")By default, panel counts are merged by summing the methylation read counts for each CpG site. You can customise the proportion of each sample that is used in the panel by specifying the ac_props argument in panel_meth_from_counts. To get the mean across each CpG site, simply pass equal proportions for each sample.
To run CAMDAC with your newly created panel, attach your panel to a CamSample object using the meth argument.
+# Load test data
+b_tumor <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+b_normal <- system.file("testdata", "normal.bam", package = "CAMDAC")
+
+# Setup CAMDAC samples
+tumor <- CamSample(id="tumor", sex="XY", bam=b_tumor)
+normal <- CamSample(id="normal", sex="XY", bam=b_normal)
+config <- CamConfig(outdir="./results", ref="./pipeline_files", bsseq="wgbs", lib="pe", cores=10)
+
+# Setup panel sample
+panel <- CamSample(id="panel", sex="XY")
+panel_file <- system.file("testdata", "test_panel.m.csv.gz", package = "CAMDAC")
+attach_output(panel, config, "meth", panel_file)
+
+# Run CAMDAC with panel
+pipeline(
+ tumor=tumor,
+ germline=normal,
+ infiltrates=panel,
+ origin=panel,
+ config=config
+)If you have not started from BAM files, you can create a panel using a matrix of beta values:
+| sample1 | +sample2 | +sample3 | +
|---|---|---|
| 0.5 | +0.6 | +0.7 | +
| 0.4 | +0.5 | +0.6 | +
Additionally, a data frame specifying the positions of each CpG site in the beta value matrix is required. Here, start and end refer to the C and G of the CpG site respectively:
+| chrom | +start | +end | +
|---|---|---|
| chr1 | +100 | +101 | +
| chr1 | +200 | +201 | +
The matrix and CpG locations can be passed directly to the panel_meth_from_beta() function, along with settings.
+# Load beta values and chromosome positions
+ex <- system.file("testdata", "test_panel_from_beta.csv", package = "CAMDAC")
+data <- data.table::fread(ex)
+mat = data[, 4:ncol(data)] # Beta value matrix with 3 samples
+
+# Create panel from beta values
+panel_beta <- panel_meth_from_beta(
+ mat = mat,
+ chrom = data$chrom,
+ start = data$start,
+ end = data$end,
+ cov = 100,
+ props = c(0.1, 0.8, 0.1), # Proportions of each sample in panel
+ min_samples = 1,
+ max_sd = 1
+)As CAMDAC requires coverage at each CpG site to estimate uncertainty, the cov value is given to all CpG sites when building a panel from beta values. Additionally, if any beta values are missing from a sample, proportions are recalculated among the remaining samples as this is the only information available to build the panel for that site.
There are two experimental arguments that can be set to filter CpG sites from the panel:
+min_samples: The minimum number of samples that have to have a beta value for a CpG to be included in the panel. The idea here is if you have sparse data, you can skip sites where you aren’t confident in the panel. Set this to 1 to use any sample.
max_sd: Maximum standard deviation of beta values across samples a CpG must have to be included in the panel. The idea here is that when combining many bulk methylomes from the same tissue, sites with high variability reflect sample-specific differences and their averages are less reliable for use in a methylation panel.
CAMDAC produces several output files that visualise the copy number state. DNA methylation rates can be passed to external packages for visualisation. For a quick view of DMRs in R:
+
+library(data.table)
+library(ggplot2)
+library(CAMDAC)
+
+# Show DMPs around a region
+dmr <- data.table(dmr) # Object from CAMDAC output *annotated_DMRs.fst
+dmp <- data.table(dmp) # Object from CAMDAC *results_per_CpG.fst
+chrome <- dmr[1, ]$chrom
+starte <- dmr[1, ]$start
+ende <- dmr[1, ]$end
+offset <- 1000 # Offset 1kB either side of region
+dmp <- data.table(dmp)
+dm_regions <- dmp[chrom == as.character(chrome) & start >= (starte - offset) & end <= (ende + offset), ]
+
+# Using ggplot, generate a geom where the m_t values are
+tplt <- ggplot(dm_regions, aes(x = start)) +
+ geom_point(aes(y = m_t), color = "skyblue") +
+ geom_point(aes(y = m_n), color = "grey") +
+ geom_vline(aes(xintercept = start, color = DMP_t)) +
+ theme_classic() +
+ scale_color_manual(values = c("skyblue", "blue")) +
+ scale_y_continuous(limits = c(0, 1)) +
+ geom_vline(xintercept = c(start, end), color = "red", linetype = "dashed") +
+ labs(x = dm_regions$chrom[[1]])
+tplt
CAMDAC DMR Visualization
+Here, light blue dots are the pure tumour, while light-grey are the normal. The red dash is the DMR region and the vertical lines are hypomethylated DMPs (blue) and hypermethylated DMPs (light blue).
+introduction.RmdSolid tumours typically contain both cancer and admixed normal contaminating cells, which confounds the analysis of bulk cancer methylomes from bisulfite sequencing. To address these issues we present CAMDAC, a tool for Copy-number Aware Methylation Deconvolution Analysis of Cancer.
+In brief, we show that the bulk tumour methylation rate (\(m_b\)) can be expressed as a weighted sum of the methylation rates of the tumour cells and normal contaminants, accounting for tumour purity and copy number (Figure 1). We derive purity and copy number estimates directly from bulk tumour RRBS data, leveraging somatic copy number aberration calls from ASCAT or Battenberg. We use bulk tissue- and sex-matched normal samples as proxy for the normal tumour-infiltrating cells (\(m_{n,i}\)), and obtain \(m_b\) from the bulk tumour data itself. This provides all the necessary information to extract the pure tumour methylation rate (\(m_t\)).
+
Figure 1. CAMDAC principles and key variables. Adapted from Larose Cadieux et al., 2020.
+
In Larose Cadieux et al., 2020, we obtained bulk tumour RRBS data from surgically resected lung cancers and patient-matched tumour-adjacent normal lung samples. Normal samples may be used for copy number profiling, as proxy a for the normal tumour-infiltrating cells (\(m_{n,i}\)), and as a proxy for the tumour cell of origin (\(m_{n,o}\)). Here, \(m_{n,i}\) is needed for bulk tumour methylation rate deconvolution and \(m_{n,o}\) is required for differential methylation analyses (Figure 2). In non-small cell lung cancer, we demonstrate that patient-matched tumour-adjacent normal is a suitable proxy for all normals, i.e. \(m_{n,i} \approx m_{n,o}\) (Larose Cadieux et al., 2020).

Figure 2. Key input and output data for CAMDAC
+
If the patient-matched tumour-adjacent normal tissue is not available, a tissue- and sex-matched normal may provide a substitute for the tumour-infiltrating normal cells (Figure 2). If the tissue-matched normal is a poor representative of the cell of origin, a different proxy may be used for differential methylation analysis.
The purified tumour methylation rates allow for accurate differential methylation analysis, both between tumour and normal cells and, in the case of multi-region sequencing, between different tumour samples. The deconvoluted methylation profiles accurately inform inter- and intra-tumour sample relationships and could enable the timing of copy number gains and (epi)mutations in tumour evolution. This is explained in more detail in Larose Cadieux et al., 2020.
+At time of writing, CAMDAC is compatible with human Msp1 digested single-end directional reduced representation bisulfite sequencing (RRBS) data and whole genome bisulfite sequencing (WGBS) data. The input must be in binary alignment map (BAM) format. Bases should be quality and adapter trimmed and PCR duplicates should be removed. BAM files may be aligned to hg19, hg38, GRCH37 and GRHCH38 reference human genome builds.
output.RmdThe CAMDAC pipeline returns a structured directory at the outdir from the CamConfig() object. The pipeline returns files unique to the RRBS and WGBS modules with the general structure:
└── <CamSample.patient_id>
+ ├── Allelecounts
+ │ ├── <CamSample.id>
+ ├── Copynumber
+ │ ├── <CamSample.id>
+ └── Methylation
+ └── <CamSample.id>
+The sections below describe each results file in more detail. Next, see vignette("questions") for frequently asked questions or vignette("experimental") for details on experimental CAMDAC features.
results/
+└── P
+ ├── Allelecounts
+ │ ├── N
+ │ │ └── P.N.SNPs.CpGs.all.sorted.RData
+ │ └── T
+ │ └── P.T.SNPs.CpGs.all.sorted.RData
+ ├── Copy_number
+ │ ├── N
+ │ │ ├── fragment_length_histogram.pdf
+ │ │ ├── msp1_fragments_RRBS.RData
+ │ │ ├── P_N_normal_SNP_data.pdf
+ │ │ ├── P.N.SNPs.RData
+ │ │ └── Rplots.pdf
+ │ └── T
+ │ ├── fragment_length_histogram.pdf
+ │ ├── msp1_fragments_RRBS.RData
+ │ ├── P_T_SNP_data.pdf
+ │ ├── P.T.ACF.and.ploidy.txt
+ │ ├── P.T.ascat.bc.RData
+ │ ├── P.T.ascat.frag.RData
+ │ ├── P.T.ascat.output.RData
+ │ ├── P.T.ASCATprofile.png
+ │ ├── P.T.ASPCF.png
+ │ ├── P.T.BAF.PCFed.txt
+ │ ├── P.T.germline.png
+ │ ├── P.T.LogR.PCFed.txt
+ │ ├── P.T.rawprofile.png
+ │ ├── P.T.SNPs.RData
+ │ ├── P.T.sunrise.png
+ │ ├── P.T.tumour.png
+ │ └── Rplots.pdf
+ └── Methylation
+ ├── N
+ │ ├── dt_normal_m.RData
+ │ └── P_N_methylation_rate_summary.pdf
+ └── T
+ ├── CAMDAC_DMPs.bed
+ ├── CAMDAC_purified_tumour.bed
+ ├── CAMDAC_results_per_CpG.RData
+ ├── P_T_DMP_stats.txt
+ ├── P_T_methylation_rate_summary.pdf
+ ├── purified_tumour.RData
+ └── tumour_versus_normal_methylomes.pdf
+| File | +Description | +
|---|---|
P.T.SNPs.CpGs.all.sorted.RData |
+Allele counts for a sample. Generated by processing BAM file | +
P.T.ascat.output.RData |
+ASCAT copy number results | +
P.T.ASCATprofile.png |
+ASCAT copy number profile | +
dt_normal_m.RData |
+Bulk normal DNA methylation data | +
purified_tumour.RData |
+CAMDAC-purified DNA methylation rates | +
CAMDAC_results_per_CpG.fst |
+CAMDAC deconvolution and differential methylation results | +
CAMDAC outputs are written in the directory given by config$outdir in the format PATIENT/DATASET/SAMPLE/:
└── P
+ ├── Allelecounts
+ │ ├── N
+ │ │ └── P.N.SNPs.CpGs.all.sorted.csv.gz
+ │ └── T
+ │ └── P.T.SNPs.CpGs.all.sorted.csv.gz
+ ├── Copynumber
+ │ ├── N
+ │ │ └── P.N.SNPs.csv.gz
+ │ └── T
+ │ ├── ascat
+ │ ├── battenberg
+ │ ├── P.T.cna.txt
+ │ ├── P.T.SNPs.csv.gz
+ │ └── P.T.tnSNP.csv.gz
+ └── Methylation
+ ├── N
+ │ └── P.N.m.csv.gz
+ └── T
+ ├── P.T.CAMDAC_annotated_DMRs.fst
+ ├── P.T.CAMDAC_results_per_CpG.fst
+ ├── P.T.m.csv.gz
+ └── P.T.pure.csv.gz
+| File | +Description | +
|---|---|
P.T.SNPs.CpGs.all.sorted.csv.gz |
+Allele counts for a sample. Generated by processing BAM file | +
P.T.SNPs.csv.gz |
+SNP counts for a sample. | +
P.T.cna.txt |
+CAMDAC CNA result | +
P.T.m.csv.gz |
+Bulk methylation data | +
P.T.m.pure.csv.gz |
+CAMDAC-deconvolved methylation data | +
P1.T.CAMDAC_results_per_CpG.fst |
+CAMDAC differentially methylated cytosines | +
P1.T.CAMDAC_annotated_DMRs.fst |
+CAMDAC differentially methylated regions | +
It is possible to manually override outputs for runs. See vignette("questions") for more details.
pipeline.RmdThe entry-point to CAMDAC is the pipeline() function which expects a CamConfig() object and four CamSample() objects representing:
The same normal sample may be passed repeatedly for the germline, infiltrates or origin, depending on your experimental design. See ?pipeline for more details.
+library(CAMDAC)
+
+# Path to BAM files
+tumor_bam <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+normal_bam <- system.file("testdata", "normal.bam", package = "CAMDAC")
+
+# Select samples for basic tumor-normal analysis
+tumor <- CamSample(id = "T", sex = "XY", bam = tumor_bam)
+normal <- CamSample(id = "N", sex = "XY", bam = normal_bam)
+
+# Configure pipeline
+config <- CamConfig(
+ outdir = "./results", bsseq = "rrbs", lib = "pe",
+ build = "hg38", refs = "./refs", n_cores = 1, cna_caller = 'ascat'
+)
+
+# Run CAMDAC
+CAMDAC::pipeline(
+ tumor, germline = normal, infiltrates = normal, origin = normal, config
+)Next, see vignette("output") for a detailed summary of CAMDAC results files.
questions.RmdIdeally, CAMDAC is run with a matched normal sample from which to derive heterozygous germline SNPs for copy number estimation. In the absence of matched normals, a panel of sex- and tissue-matched normal samples may be used by averaging DNA methylation rates from multiple patients. See vignette("experimental") for more information.
Please raise an issue on GitHub to request files for a new reference genome.
+When calling pipeline if you do not give a normal infiltrate or cell of origin, the pipeline skips deconvolution and differential methylation respectively. This may be useful to run a quick first-pass to find and refit copy number solutions. When CAMDAC has found a solution and is rerun with the same tumor, config, and normal, the infiltrates and cell_of_origin arguments will continue the pipeline where it left off. The entire pipeline can be re-run be deleting the output directory or setting overwrite=TRUE in the CamConfig.
The simplest way is to call pipeline with overwrite=FALSE in your config, giving the right normal sample for your step. Additionally, you CamConfig must run with the same output directory.
If for any reason, you have changed the output directory structure from previous run, you can initiate CAMDAC by manually passing outputs to CamSample objects. See the vignette vignette("output") for more information.
Finally, you can run the cmain_* functions used by pipeline() directly. For example, to run the deconvolution step, you can call cmain_deconvolve_methylation().
If you want to use an external purity and ploidy solution, simply pass a CNA file that has only the purity and ploidy fields. Additionally, set refit==TRUE in the CamConfig and CAMDAC will use this to refit the sample.
To analyse specific genomic regions, you may pass a BED file to CAMDAC config:
+
+CamConfig(outdir=".", ref="./pipeline_files", regions="regions.bed")CAMDAC will merge any overlapping regions prior to analysis.
+If you have outputs from a previous run, you can manually assign them to a CAMDAC object. This overwrites the expected path for that output type, allowing the pipeline to run with this data instead of computing it. Use the attach_output function, passing one of three arguments:
counts: CAMDAC allele counts *.SNP.CpGs.all.sorted.csv.gz filesnps: CAMDAC sample SNP counts *.SNPs.csv.gz filemeth: CAMDAC bulk methylation *.m.csv.gz filecna: CAMDAC CNA *.cna.txt filepure: CAMDAC deconvolved methylation *.m.pure.csv.gz fileFor example, to attach a previous counts file to a CAMDAC object:
+
+library(CAMDAC)
+tumor <- CamSample(id = "T", sex = "XY", bam = NULL)
+config <- CamConfig(outdir = tempdir(), build="hg38", bsseq="wgbs", lib="pe")
+counts_file <- system.file("testdata", "test.SNPs.CpGs.all.sorted.csv.gz", package = "CAMDAC")
+tumor <- attach_output(tumor, config, "counts", counts_file)The CAMDAC pipeline can now access the file in the expected location at config$outdir.
setup.RmdFrom the R console, install CAMDAC from github:
+
+install.packages("remotes")
+remotes::install_github("VanLoo-lab/CAMDAC")CAMDAC requires custom annotation files for RRBS and WGBS analysis, available at the Zenodo repository: (10565423). An R convenience function is provided to download these files:
+
+CAMDAC::download_pipeline_files(bsseq = "rrbs", directory = "./refs")
+CAMDAC::download_pipeline_files(bsseq = "wgbs", directory = "./refs")Now, you’re ready to run CAMDAC! Next, see vignette("pipeline").
CAMDAC searches for pipeline files in the following order:
+CamConfig())We recommend that you set the environment variable CAMDAC_PIPELINE_FILES to the directory where you downloaded the files. This will allow CAMDAC to find the files automatically whenever you load R.
From a Unix terminal:
+++echo “CAMDAC_PIPELINE_FILES=$(realpath R)” >> ~/.Renviron
+
CAMDAC-RRBS
+CAMDAC WGBS
+java: To run CAMDAC on WGBS data, we leverage Battenberg which requires the java command-line utility. Download Java from https://openjdk.org/.technical.Rmd
In this section, we provide a high-level summary of the CAMDAC pipeline, which covers six key steps:
For a full outline and validation of CAMDAC, please see Larose Cadieux et al. (2020) bioRxiv.
+Take a hypothetical female patient with primary tumour sample ID “T1” and normal-adjacent sample ID “N1”. First, CAMDAC takes the sequencing alignment files from each sample using the CamSample() functions, users should provide the full path and file name for the RRBS or WGBS binary mapping alignments (.bam) files for input samples, and use the CamConfig() sample to indicate whether they are aligned hg19, hg38, GRCH37 or GRCH38. Bases should be quality and adapter trimmed and PCR duplicates should be removed. Please ensure that the bam file is sorted and indexed.
CAMDAC employs an allele counter module to count SNP and CpG (methylation) alleles for downstream analysis. SNP counts are performed at 1000 genome SNP positions, and CpG alleles are counted using dinucleotides. To speed up the computation, we leverage a reference RRBS and WGBS genome files listing all genomic regions supported by the respective platforms.
+By default, the read mapping quality filter is set to mq>=0 as default in CamConfig(). Mapping quality scores from bisulfite sequencing aligners may be biased against the alternate allele for reads with polymorphisms. Please review the mapping quality distribution of your data to determine if it is appropriate to increase this setting.
If the function is successful, a signle file output with the suffix “SNPs.CpGs”. This file carries compiled SNP and methylation information with the following columns:
+
Figure. Formatted SNP and methylation information
+Each row is either a CG locus (and CCGG for RRBS) and/or a 1000g SNP position. These can be distinguished by the width column. While polymorphic CG/CCGG have the same width as their non-polymorphic counterpart, they are easily identified by looking at the POS, ref, alt and other SNP-informative columns.
+For each SNP locus, 1000 Genomes genomic coordinate and reference and alternate alleles are listed under POS, ref and alt columns. The total_counts is the sum of alt_counts and ref_counts, which including all informative strand-specific allele counts. For example, at \(C>T\) SNPs, only the reverse strand allows to distinguish between the (un)methylated reference and the alternate allele and thus all forward read counts would be excluded from the total_counts column, but included in the total_depth. The SNP type column is only added to the patient-matched normal, which is used to assign SNP genotypes as either Homozygous or Heterozygous based on internal B-allele frequency (BAF) cut-offs.
+M, UM, total_counts_m, and m are the counts methylated, counts unmethylated, the total counts (un)methylated and the methylation rate, respectively. Methylation rates are calculated per CG allele, meaning that at polymorphic CpGs, only the CG-forming allele counts are considered. CAMDAC methylation rates are therefore polymorphism-independent.
+For CCGG loci found in RRBS, the CCGG column indicates the number of fragments with a 5’ end at this CCGG loci. This number may be 0 at polymorphic CCGG loci homozygous for the CCGG-destroying allele. Furthermore, for RRBS, MspI fragment boundaries are determined from the aligned reads and MspI fragment the size distribution is visualised for quality assessment in the file fragment_length_histogram.pdf. You should observe 3 disctinct peaks in the fragment length distribution. This is characteristic of human RRBS libraries and originates from MspI containing micro-satellite repeats of distinct lengths. The MspI fragment boundaries and their GC content are saved as an .RData object and used downstream in RRBS copy number profiling.
+
Figure. MspI fragment size distribution
+B-allele frequencies at heterozygous SNPs are leveraged to calculate pure tumour copy number aberrations using either ASCAT.m for RRBS or Battenberg.m for WGBS. These tools are inspired from ASCAT (Van Loo et al., 2010) and Battenberg (Nik-Zainal et al., 2012). If sucessful, CAMDAC writes copy number output to the “Copy_number” directory.
+A SNPs file lists the heterozygous SNPs selected for copy number analysis, resulting in a table where each row is a 1000g SNP position with minimum coverage defined by the germline sample with a minimum coverage set by the min_normal argument. The total_counts column is the total informative read counts. For example, at C\(>\)T SNPs, only the reverse strand allows to distinguish between the unmethylated reference and the alternate allele and thus, forward read counts would not contribute to the total_counts and the BAF (B-allele frequency calculation). rBAF is randomly assigned BAF or 1-BAF to remove biases against the alternate allele in downstream tumour copy number profiling. All read counts however contribute to the total_depth which is used for LogR calculation, a measure of total coverage. Genotyping is performed and assignments stored under type.
For the RRBS pipeline, we provide an experimental feature to visualise the magnitude of biases against alternate of (B)-alleles. The number of homozygous to heterozygous SNPs is depicted and any biases in coverage against the latter can be evaluated. Due to being biases for CpG-rich genomic regions, a typical RRBS sample should show a high ratio of C\(>\)T SNPs. We note that C\(>\)T and A\(>\)G germline heterozygous SNPs will have roughly half the coverage of the 4 types of SNPs.
+
Figure Normal SNP data QC
+In addition to the above-mentionned columns, we also adjust for biases in the tumour LogR. The LogR is a normalised measure of tumour coverage used by ASCAT.m and Battenberg.m for copy number profiling together with the BAF. The covariates used for LogR correction are:
+Next, the standard ASCAT or Battenberg output are then generated. All files have the dot-separated patient and sample IDs as prefix. In addition, we plot the BAF and LogR. In the BAF profiles, heterozygous SNPs are highlighted in red. The BAF and LogR tracks are then segmented by the respective tools. The segmentation is then analysed to determine the optimal tumour purity and ploidy solution via a grid search (see sunrise plot). Raw and rounded allele-specific copy number segments are provided as output png images.
+Finally, the purity, ploidy, number of heterozygous and homozygous 1000g SNP positions and median tumour and normal SNP depth are saved for each tumour sample. For RRBS, summary SNP data is plotted and saved as a pdf with filename "*_SNP_data.pdf*" and may help you troubleshoot your data.
+
Figure. Tumour SNP data summary
+As part of the allele counting step, CAMDAC calculates bulk DNA methylation rates for each input sample. For the patient- and tissue-matched normal sample “N1”, the methylation data columns have the suffix is \(x = n\), since \(m_{n,i} \sim m_{n,o}\). Where \(m_{n,i} \neq m_{n,o}\), the suffix is set to \(x = n\_i\) for the normal infiltrates and \(x = n\_o\) for the normal cell of origin proxy sample. The uncertainty on \(m_{x}\) is computed as the lower and upper boundaries of the 99% Highest Density Interval (HDI) are stored under columns \(m_{x,low}\) and \(m_{x,high}\).
+
Figure. Normal methylation output.
+In the normal sample methylation output directory, you will find a pdf with methylation data summary and QC (RRBS only). We expect DNA methylation rates to sit near 0 and 1. CAMDAC calculates DNA methylation rates in a polymorphism-independent manner, meaning that the CG-destroying allele at a heterozygous CpG does not contribute to its methylation rate. The minimum coverage threshold applied to CpG sites is based on the CpG allele read depth, so any heterozygous SNPs present at the CG location may be removed due to insufficient coverage.
+
Figure. Normal methylation rate QC.
+At this stage, CAMDAC has obtained methylation rates for both the normal infiltrates and bulk tumour, as well as tumour copy number and purity estimates. The DNA methylation profile of the normal-adjacent samples may be used as a proxy for the methylation rate of tumour-infiltrating normal cells (\(m_{n,i}\)). We have all the necessary information to obtain CAMDAC pure tumour methylation rates, \(m_t\).
+In the Methylation/ output directory, CpG copy number and purified tumour methylation data are written to output CSV files. Header fields include:
+CAMDAC-deconvoluted methylation rate can have any values between 0 and 1 while the range of bulk tumour methylation rates is driven by tumour DNA content. In the bulk tumour profiles, bi-allelic tumour-normal differentially methylated positions appear at intermediate methylation values while after purification, they form a peak near 0 or 1 for hypo- and hypermethylated positions, respectively.

Figure. Tumour versus normal methylation rates from before and after CAMDAC.
+For tumour-normal differential methylation analysis, CAMDAC expects a DNA methylation profile representing the tumour cell of origin (\(m_{n,o}\)). In this hypothetical example, we set the normal sample N1 as the cell of origin. Leveraging CAMDAC purified methylomes, we then obtain differentially methylated positions and regions.
+Differential DNA methylation is detected with a minimum tumour-normal methylation rate difference (effect size, where \(\delta\beta\) >= 0.2) and a probability threshold, representing the probability that the tumour and normal beta distributions do not overlap. Both variables are used for calling differentially methylated positions (DMPs).
+Next, CAMDAC builds on DMP calls to call DMRs. To identify differentially methylated regions (DMRs), we group CpGs into bins and look for clusters with at least 5 DMPs (min_DMP_counts_in_DMR=5), 4 of which must be consecutive (min_consec_DMP_in_DMR=4). After completion, this function generates a pure tumor methylation file (CAMDAC_results_per_CpG.RData for RRBS or pure.csv.gz for WGBS) in the CAMDAC methylation output directory. This R object is a combination of all CAMDAC results per CpG with DMP information included:
+The ratio of hyper- to hypomethylated DMRs varies across genomic regions is reflected by the tumour-normal methylation rate difference.
+
Figure. DMR summary data.
+CAMDAC outputs will be stored at the user-defined project outdir variable given to the configuration (CamConfig()). A patient folder is created at this path with directory name set to patient_id. This will contain 3 subdirectories: Allelecounts, Copy_number and Methylation, with further sub-directories created for each of a given patient’s samples.
With CAMDAC differential methylation calls in hand, users may choose to look for recurrently aberrated loci across their cohort. Note that tumour-tumour DMPs can be easily identified by looking for overlap between the 99% HDIs for CAMDAC pure tumour methylation rates between samples (99% HDI \(\subseteq\) [m_t_low,m_t_high]).
+Clustering* analyses can also easily be performed by the user using well-established R packages such as ‘pvclust’ for hierarchical clustering with bootstrap and ‘umap’ (uniform manifold approximation and projection) for non-linear dimensionality reduction. Clustering of pure tumour methylation rates at promoter DMRs across large cohorts by ‘umap’ may reveal histology and/or sex-driven clusters as described in non-small cell lung cancer Larose Cadieux et al., 20201.
+For multi-region data, sample tree reconstruction by neighbour joining leveraging CAMDAC pure tumour methylation rates at hypermethylated DMPs in at least on sample, subset to loci confidently unmethylated in the normal cell of origin (m_n_high<0.2), can reveal inter-sample relationships, as demonstrated in non-small cell lung cancer Larose Cadieux et al., 20201.
+When running gene-set enrichment analysis (GSEA) on CAMDAC DMR calls, gene sets should be limited to those genes with promoters covered by RRBS. It may be desirable to subset DMR calls to hypermethylated promoter-associated CpG Islands given that methylation at these loci is most correlated with expression.
+Users may leverage normal, deconvoluted tumour methylation rates and tumour-normal DMP calls to separate clonal mono- and bi-allelic from subclonal bi-allelic methylation changes to shed light into tumour evolutionary histories Larose Cadieux et al., 20201. The allele-specific CAMDAC module will be made available in future releases.
+Copy-number Aware Methylation Deconvolution Analysis of Cancer (CAMDAC) is an R library for deconvolving bulk tumor DNA methylation (bisulfite) sequencing data (Larose Cadieux et al., 2022, bioRxiv).
+ +CAMDAC can be installed from an R console:
+
+install.packages("remotes")
+remotes::install_github("VanLoo-lab/CAMDAC")Download reference datasets required to run CAMDAC for RRBS and/or WGBS analysis from the Zenodo repository: (10565423). An R helper function is provided for convenience:
+
+CAMDAC::download_pipeline_files(bsseq = "rrbs", directory = "./refs")
+CAMDAC::download_pipeline_files(bsseq = "wgbs", directory = "./refs")Run the tumor-normal deconvolution pipeline with test data:
++[!NOTE]
+We provide downsampled BAM files for testing the pipeline. For representative results, please use your own BAM files.
+library(CAMDAC)
+
+tumor_bam <- system.file("testdata", "tumor.bam", package = "CAMDAC")
+normal_bam <- system.file("testdata", "normal.bam", package = "CAMDAC")
+
+# Select samples for basic tumor-normal analysis
+tumor <- CamSample(id = "T", sex = "XY", bam = tumor_bam)
+normal <- CamSample(id = "N", sex = "XY", bam = normal_bam)
+
+# Configure pipeline
+config <- CamConfig(
+ outdir = "./results", bsseq = "rrbs", lib = "pe",
+ build = "hg38", refs = "./refs", n_cores = 1, cna_caller='ascat'
+)
+
+# Run CAMDAC
+CAMDAC::pipeline(
+ tumor, germline = normal, infiltrates = normal, origin = normal, config
+)For a more detailed walkthrough with test data, see vignette("pipeline").
To contribute to CAMDAC, fork the repository and install the development dependencies with remotes::install_dev_deps('.').
After making your changes, run the build and test commands listed in vignette("contributing").
Finally, submit a pull request with the changes on your fork.
+pipeline() function.CamConfig.RdSet CAMDAC configuration
+CamConfig(
+ outdir,
+ bsseq,
+ lib,
+ build,
+ n_cores = 1,
+ regions = NULL,
+ refs = NULL,
+ n_seg_split = 50,
+ min_mapq = 1,
+ min_cov = 1,
+ min_normal_cov = 10,
+ overwrite = FALSE,
+ cna_caller = "battenberg",
+ cna_settings = NULL
+)A path to save CAMDAC results. The results folder structure +follows the format PATIENT/DATASET/SAMPLE/.
Bisulfite sequencing platform. Choose between "wgbs" or "rrbs".
Bisulfite sequencing library. Choose "pe" for paired end, "se" for single end.
Reference genome build. Choose "hg38" or "hg19".
Number of cores to process CAMDAC data in parallel wherever possible.
A BED file with regions to restrict the analysis to
Path to CAMDAC reference files. If this is not given, CAMDAC searches the +environment variable CAMDAC_PIPELINE_FILES. If this is not set, CAMDAC searches recursively in the current +working directory.
Minimum mapping quality filter used in cmain_allele_counts().
Minimum coverage filter for: DNA methylation, Normal SNP selection.
Config to overwrite files if they already exist.
The CNA caller to use. "ascat" or "battenberg". Default is "battenberg"
A list of settings to pass to the CNA caller. rho, psi, java, beaglemaxmem
CamSample.RdBuild CAMDAC sample object
+CamSample(id, sex, bam = NULL, patient_id = "P")Unique identifier for the sample
The sex of the patient, "XX" or "XY". Required for CNA calling.
Sample BAM file. If not given, CAMDAC expects files linked with attach_output.
An identifier for the patient
HDIofICDF.RdHDI of ICDF
+HDIofICDF(ICDFname, credMass = 0.99, tol = 0.0001, ...)The inverse cumulative density function of the distribution.
The desired mass of the HDI region.
Tolerance parameter for optimisation. the lower the tolerance,the +longer the optimisation, but the higher the accuracy. +According to CAMDAC RRBS comments, tol=1e-4 gives values +of the same accuracy as our max resolution. +This function is adapted from Greg Snow's TeachingDemos package +E.g.Determine HDI of a M=30 and UM=12 CpG +Adding 1 to shape parameter ensures uniform beta(1,1) is updated with our counts +HDIofICDF(qbeta,shape1 = 30+1 , shape2 = 12+1 )
Highest density interval (HDI) limits in a vector.
+HDIofMCMC.RdHDI of MCMC
+HDIofMCMC(M_b, UM_b, M_n, UM_n, p, CN, CN_n, credMass = 0.99)counts methylated in the tumour
counts unmethylated in the tumour
counts methylated in the normal
counts unmethylated in the normal
tumour purity
total tumour copy number
total normal copy number
default is 0.99 +credMass is a scalar between 0 and 1, indicating the mass within the +credible interval that is to be estimated.
Value: HDIlim is a vector containing the limits of the HDI
+HDIofMCMC_mt.RdComputes highest density interval from a sample of representative values, +estimated as shortest credible interval for a unimodal distribution
+HDIofMCMC_mt(M_b, UM_b, M_n, UM_n, p, CN, credMass = 0.99)counts methylated in the tumour
counts unmethylated in the tumour
counts methylated in the normal
counts unmethylated in the normal
tumour purity
total tumour copy number
default is 0.99 +credMass is a scalar between 0 and 1, indicating the mass within the +credible interval that is to be estimated.
total normal copy number
Value: HDIlim is a vector containing the limits of the HDI
+LogR_correction.RdCorrect logR for msp1 fragment size bias and GC content
+LogR_correction(
+ dt_sample,
+ dt_SNPs,
+ build,
+ chr_names,
+ min_normal,
+ fragments_file,
+ replic_timing_file_prefix,
+ n_cores
+)Allelecounts output as a data.table
Allelecounts output subset to QC'ed SNP positions
Character variable corresponding to the reference genome version used for alignment
Character variable with the seqlevels.
Numerical with the minimum normal coverage threshold
CAMDAC reference MspI fragments file
CAMDAC reference replication timing files path and file name prefix
Numerical value correspdonding to the number of cores for parallel processing
annotate_copy_number.Rdannotate_copy_number returns the data.table dt_sample annotated with allele-specific copy numbers
annotate_copy_number(dt_sample, seg, rm_sex_chrom = FALSE)data.table object with each CpG and their coverage, counts methylated and methylation rate
ASCAT.m copy number segements object
Logical indicating if you would like to remove sex chrom from downstream analyses
A dataframe for each sample_id with the copy number calls added
+ascat.m.plotRawData.RdPlot tumour and germline BAF and LogR
+ascat.m.plotRawData(ASCATobj, raw_LogR, pch, cex, lim_logR)an ASCAT object (e.g. data structure from ascat.loadData)
vector with the LogR values before correction
type of data points in plot
size of data points in plot
y-axis limits on logR plot
Produces png files showing the logR and BAF values for tumour and germline samples
+ascat.m.plotSegmentedData.RdPlot segmentated BAF LogR
+ascat.m.plotSegmentedData(ASCATobj, lim_logR = 2)an ASCAT object (e.g. data structure from ascat.loadData)
Produces png files showing the logR and BAF values for tumour and germline samples
+ascat.plotRawData.flags.RdPlot BAF LogR
+ascat.plotRawData.flags(ASCATobj, pch, cex, lim_logR)an ASCAT object (e.g. data structure from ascat.loadData)
type of data points in plot
size of data points in plot
y-axis limits on logR plot
Produces png files showing the logR and BAF values for tumour and germline samples
+asm_pipeline.RdRun allele-specific methylation analysis pipeline
+asm_pipeline(tumor, germline = NULL, infiltrates = NULL, origin = NULL, config)CamSample object for tumor sample.
CamSample object for germline sample. Used for CNA calling.
CamSample object for infiltrating normal sample. Used for deconvolution.
CamSample object for cell of origin sample. Used for differential methylation.
CamConfig object.
attach_output.RdManually assign output file to CAMDAC sample
+attach_output(sample, config, code, file)CamSample object
CamConfig object
Code for output file. See vignettes("output") for descriptions.
Path to file to copy to expected location
bin_CpGs.Rdbin_CpGs returns the df with the annotation for each CpG
bin_CpGs(path, patient_id, sample_id, dt, anno_list, n_cores)Character string of the output directory
Character string containting the patient ID
Character string containting the sample ID.
data.table where each CG is a row with DMP info.
A data.table object containing annotated genomic bins including +genes, exons, introns, UTRs, CGI, CGI shores, CGI shelves, promoters or enhancers
number of cores for parallel processing
A dataframe for each sample_id with the copy number calls added
+calculate_m_t_hdi.RdCalculate HDI by simulation
+calculate_m_t_hdi(meth_c, n_cores, itersplit = 100000)call_dmps.RdCall differentially methylated positions
+call_dmps(
+ pmeth,
+ nmeth,
+ effect_size = 0.2,
+ prob = 0.99,
+ itersplit = 500000,
+ ncores = 5
+)call_dmr_routine.RdFunction to call DMRs on a camdac dmp dataset
+call_dmr_routine(
+ tmeth_dmps,
+ regions_annotations,
+ min_DMP_counts,
+ min_consec_DMP
+)camdac_to_battenberg_prepare_wgbs.Rdcamdac_to_battenberg_prepare_wgbs converts CAMDAC allele counter results to a format for processing.
camdac_to_battenberg_prepare_wgbs(
+ tumour_prefix,
+ normal_prefix,
+ camdac_tsnps,
+ outdir
+)CAMDAC tumour allele counts filepath. Expected *.gz
CAMDAC normal allele couts filepath. Expected *.gz
CAMDAC tumour-normal-snps object. Expected *.gz
allelecounter formatted-file output directory.
File handle for allele counter file generated
+cmain_bind_snps.RdCombing tumour-normal SNP file for CNA analysis (ASCAT or BATTENBERG)
+cmain_bind_snps(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_call_cna.RdConfig determines whether ASCAT or Battenberg is used
+cmain_call_cna(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_call_dmps.RdSingle-sample DMP calling on CAMDAC-deconvolved data
+cmain_call_dmps(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_call_dmrs.RdSingle-sample DMR calling on CAMDAC DMP data
+cmain_call_dmrs(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_count_alleles.RdCount alleles
+cmain_count_alleles(sample, config)A camdac sample object
A camac allele object
cmain_deconvolve_methylation.RdDeconvolve methylation
+cmain_deconvolve_methylation(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_make_methylation_profile.RdPre-process methylation from allele counts for CAMDAC deconvolution
+cmain_make_methylation_profile(sample, config)A camdac sample object
A camdac config object
cmain_make_snps.RdFormat and save SNP file for CNA analysis (ASCAT or BATTENBERG)
+cmain_make_snps(sample, config)A camdac sample object
A camdac config object
cmain_run_ascat.RdExpects SNP profiles to have been created using cmain_make_snp_profiles
cmain_run_ascat(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_run_battenberg.RdExpects SNP profiles to have been created using cmain_make_snp_profiles
cmain_run_battenberg(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
collapse_cpg_to_dmr.RdSummarise CG stats per DMR
+collapse_cpg_to_dmr(dt)compute_tumour_methylome.Rdcompute_tumour_methylome returns the data.table dt annotated with
+CAMDAC pure tumour methylation rates
compute_tumour_methylome(dt, p, min_cov_t = 3, sex, build)data.table object with each CpG and their coverage, counts methylated, +methylation rate and copy number and matched normal methylation info
Numerical - Sample purity estimates
Numerical - Minimum tumour coverage
Character variable with the patient expressed as "XX" for female or "XY" for male.
Character variable corresponding to the reference genome used for alignment.
A dataframe for each sample_id with the tumour methylome added
+cwrap_asm_get_allele_counts.RdCount alleles for reads phased to SNPs in a BAM file
+cwrap_asm_get_allele_counts(
+ bam_file,
+ snps_gr,
+ loci_dt,
+ paired_end,
+ drop_ccgg,
+ min_mapq = min_mapq,
+ min_cov = min_cov
+)Path to BAM file
GRanges object with heterozygous SNP loci for phasing
Data table with CAMDAC CpG loci from reference files
Logical indicating if BAM is paired end
Logical indicating if CCGG should be dropped (i.e. rrbs mode)
Minimum mapping quality to consider a read
Minimum coverage to consider a read
A list with three slots: stats, qnames and asm_cg. stats describes counts of reads phased, +qnames determines which SNPs each read was phased to and asm_cg is the data table with read counts
+download_pipeline_files.RdCAMDAC pipeline files are required for analysis. This function downloads the files to +the output directory and unpacks them. By default, CAMDAC searches for the files in the +environment variable CAMDAC_PIPELINE_FILES. If this is missing, the current directory is used.
+CAMDAC pipeline files are required for analysis. This function downloads the files to +the output directory and unpacks them. By default, CAMDAC searches for the files in the +environment variable CAMDAC_PIPELINE_FILES. If this is missing, the current directory is used.
+download_pipeline_files(bsseq, directory = NULL, quiet = TRUE)
+
+download_pipeline_files(bsseq, directory = NULL, quiet = TRUE)Optional. Directory to download files to.
Sequencing assay. Either wgbs or rrbs.
format_methylation_dfformat_methylation_df.RdFormat methylation rates
+format_methylation_df
format_methylation_df(
+ dt,
+ sample_id,
+ normal_ids,
+ path_output,
+ n_cores,
+ suffix,
+ trim = FALSE
+)data.table containing the methylation information for each CpG
sample ID
sample ID of normal sample(s)
output directory
number of threads for HDI calculation
string containing the column names suffix for normal samples +This is to distinguish between the proxy supplied for the normal infiltrates +for use in deconvolution and the normal cell of origin for use in DMP/DMR calling
Logical value establishing whether regions with extremely high coverage be trimmed or not
A GRanges object with all the CpG loci, their coverage, counts methylated and methylation rate
+format_outputformat_output.RdFormat output nucleotide counts
+format_output
format_output(
+ patient_id,
+ sample_id,
+ sex,
+ is_normal = FALSE,
+ path,
+ path_to_CAMDAC,
+ build
+)Character variable containting the patient id number
Character variable with the sample ID
Character variable with the patient expressed as "XX" for female or "XY" for male.
Logical flag set to false if the sample to be formatted is normal or tumour
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions. +Do not alter the output directory structure while running CAMDAC.
Character variable containting the path to the CAMDAC directory +including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38". +is desired in addition to GRanges object in .RData file
Concatenated SNP and CpG information
+get_DMPs.Rdget_DMPs returns a df with annotated statistics for each CpG
get_DMPs(path, patient_id, sample_id, df, prob = 0.99, n_cores)Complete path to the CAMDAC methylation output directory +fir this sample
Character string containting the patient number
Character variable with the tumour sample_id
A data.table with pure, bulk and normal methylation info
Numerical value representing the threshold for statistically +significant DMP (default is p=0.99)
Number of cores to do the statistical testing over
A data.table object with all the CpG loci, their coverage, counts +methylated and methylation rate
+get_DMRs.Rdannotate_DMRs returns the df with the annotation for each CpG
get_DMRs(
+ path,
+ patient_id,
+ sample_id,
+ dt,
+ anno_list,
+ min_DMP_counts,
+ min_consec_DMP,
+ n_cores,
+ bulk = FALSE
+)Character string of the output directory
Character string containting the patient_id ID
Character string containting the sample ID.
dataframe where each CG is a row with DMP info.
A data.table object containing annotated genomic bins including +genes, exons, introns, UTRs, CGI, CGI shores, CGI shelves, promoters or enhancers
Numerical - number of DMPs required in a DMR
Numerical - number of consecutive DMPs required in a DMR
number of cores for parallel processing
A dataframe for each sample_id with the copy number calls added
+get_allele_countsget_allele_counts.RdCompile allele counts at SNPs and at CpGs for bisulfite sequencing data
+get_allele_counts
get_allele_counts(
+ i,
+ patient_id,
+ sample_id,
+ sex,
+ bam_file,
+ mq = 0,
+ path,
+ path_to_CAMDAC,
+ build = NULL,
+ n_cores,
+ test = FALSE,
+ paired_end = TRUE
+)Integer loop index. The function must be run with all values from 1 to 25, each containing +1/25th of the RRBS covered genome.
Character variable containting the patient id
Character variable with the sample id
Character variable with the patient sex expressed as "XX" for female or "XY" for male.
Character variable with the full bam file name and path
Character variable or numeric containting the mapping quality treshold to be used. +For RRBS, set mq=0. Read mapping validity is based on read start site and nucleotides rather than mq.
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions. +Do not alter the output directory structure while running CAMDAC. +The function output of this function will be a sub-directory of the path variable under +"./Allelecounts/sample_id/". Do not change the directory structure as subsequent functions will +look for files in this directory.
Character variable containting the CAMDAC installation path (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the number of cores for parallel processing
Logical value indicating whether this is a quick test run with data subsampling
One .fst file including methylation info at CpGs and BAF and depth of coverage at +SNPs for the ith subset of RRBS loci
+get_cluster_counts.RdCount CpGs within DMP annotations
+get_cluster_counts(dt)get_differential_methylation.Rdget_differential_methylation
get_differential_methylation(
+ patient_id,
+ sample_id,
+ sex,
+ normal_origin_proxy_id,
+ path,
+ path_to_CAMDAC,
+ build,
+ effect_size = 0.2,
+ prob = 0.99,
+ min_DMP_counts_in_DMR = 5,
+ min_consec_DMP_in_DMR = 4,
+ n_cores,
+ reseg = FALSE,
+ bulk = FALSE
+)Character variable containting the patient id number
Character variable with the tumour sample_id
Character variable with the patient expressed as "XX" for +female or "XY" for male.
Character variable with the sample ID +of the normal to be used as a proxy for the tumour cell of origin in
Character path variable pointing to the desired working +directory. This is where the output will be stored.
Character variable containting the path to the CAMDAC +directory including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome +used for alignment. CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical containting the minimum tumour-normal methylation +difference (default is 0.2)
Numerical value representing the threshold for statistically +significant DMP (default is p=0.99)
Numerical value representing the number of +DMPs required in a DMR
Numerical value representing the number of +consecutive DMPs required in a DMR
Numerical value correspdonding to the number of cores +for parallel processing
Logical value should be set to FALSE. Multi-sample re-segmentation of +the copy number profiles will be available in future versions of CAMDAC.
Default is FALSE unless you want bulk DMP/DMR calls in addition +to CAMDAC pure tumour differential methylation analysis
+Note: +#' Annotation include: +CGI (including shore and shelves) +gene body (intragenic, 5UTR, 3UTR, intron, exon) +promoter (2kb upstream and 500 downstream any UCSC annotated gene) +enhancer (vista and FANTOM5 annotation)
Biologically significant DMPs, DMRs
+get_msp1_fragments.Rdget msp1 fragments
+get_msp1_fragments(dt, build, path_to_CAMDAC, outfile)data.table object with containing all covered CCGGs in the sample
Character, Either "hg19", "hg38", "GRCH37","GRCH38"
Character string containting the path to the CAMDAC dir including +dir name e.g. "~/CAMDAC/"
character srting with output filename
get_pure_tumour_methylation.Rdget_pure_tumour_methylation
get_pure_tumour_methylation(
+ patient_id,
+ sample_id,
+ sex,
+ normal_infiltrates_proxy_id,
+ path,
+ path_to_CAMDAC,
+ build,
+ n_cores,
+ reseg = FALSE
+)Character variable containting the patient id number
Character variable with the (control or tumour) sample_id
Character variable with the patient expressed as "XX" for +female or "XY" for male.
Sample ID of the matched normal control
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions.
Character variable containting the path to the CAMDAC +directory including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome +used for alignment. CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the number of cores +for parallel processing
Logical value should be set to FALSE. Multi-sample re-segmentation of +the copy number profiles will be available in future versions of CAMDAC.
+Note: +#' Annotation include: +CGI (including shore and shelves) +gene body (intragenic, 5UTR, 3UTR, intron, exon) +promoter (2kb upstream and 500 downstream any UCSC annotated gene) +enhancer (vista and FANTOM5 annotation)
CAMDAC purified tumour methylation rates
+get_reference_files.RdGet CAMDAC reference files from config
+get_reference_files(config, type_folder, glob = NULL)helper_camdac_pileup.RdCache existing CAMDAC results into a sub-directory so that the current ones can be +overwritten by the refitting pipeline +Decided this is unnecessary as the initial results were so wrong. +Exported only for development
+helper_camdac_pileup(bam_file, seg, loci_dt)
+ All functions+ + |
+ |
|---|---|
| + + | +Set CAMDAC configuration |
+
| + + | +Build CAMDAC sample object |
+
| + + | +Manually assign output file to CAMDAC sample |
+
| + + | +Bind SNPs |
+
| + + | +Call CNA |
+
| + + | +Call tumour-normal DMPs |
+
| + + | +Call tumour-normal DMRs |
+
| + + | +Count alleles |
+
| + + | +Deconvolve methylation |
+
| + + | +Make methylation |
+
| + + | +Make SNPs |
+
| + + | +Run ASCAT.m |
+
| + + | +Run battenberg |
+
| + + | +Download CAMDAC pipeline files |
+
| + + | +Get CAMDAC reference files from config |
+
| + + | +Parse ASCAT and Battenberg output directories to load CNA data |
+
| + + | +Load allele count files |
+
| + + | +Panel ASM from counts Basic function to create an ASM methylation panel from allele count or ASM meth files WARNING: In active development. |
+
| + + | +Make CAMDAC methylation panel from a matrix of beta values |
+
| + + | +Make CAMDAC methylation panel from allele counts Methylation fractions are obtained by summing M and UM reads across samples |
+
| + + | +CAMDAC analysis pipeline |
+
| + + | +Preprocess a list of CamSample objects for ASM analysis |
+
| + + | +Preprocess a list of CamSample objects for analysis |
+
intervalWidth_r.RdCalculate intervalWidth_r
+intervalWidth_r(lowTailPr, ICDFname, credMass, ...)is R's name for the inverse cumulative density function +of the distribution.
is the desired mass of the HDI region.
is passed to R's optimize function, +the lower the tolerance,the longer the optimisation, but the higher the accuracy. +tol=1e-4 gives values of the same accurary as our max resolution +Return value: +Highest density iterval (HDI) limits in a vector. +Example of use: For determining HDI of a beta(30,12) distribution, type +HDIofICDF( qbeta , shape1 = 30+1 , shape2 = 12+1 ) +Notice that the parameters of the ICDFname must be explicitly named; +e.g., HDIofICDF( qbeta , 30+1 , 12+1 ) does not work. +Adapted and corrected from Greg Snow's TeachingDemos package. +Source fct outside of loop to speed up code
load_cna_data.RdSee "annotate_copy_number" func +A function required to load copy number for a tumour sample from camdac, either ascat or bb, +result should be: chrom, start, end, nA, nB, CN (total), seg_min and seg_max. +This should also include the purity and ploidy. As a separate list? +note that seg_min and seg_max are actually duplicates of the start and end columns, required to +keep track of the ascat segment positions after overalp +WARN: This drops sex chromosome but not implimented. Also should drops CN=0 (hom del) regions
+load_cna_data(tumour, config, data_type)load_panel_ac_files.RdLoad allele count files
+load_panel_ac_files(ac_files, cores = 5)Allele count files from CAMDAC
List of data tables for each allele counts file
+panel_asm_from_counts.RdPanel ASM from counts +Basic function to create an ASM methylation panel from allele count or ASM meth files +WARNING: In active development.
+panel_asm_from_counts(c1, c2)First ASM allele counts file to merge
Second ASM allele counts file to merge
panel_meth_from_beta.RdMake CAMDAC methylation panel from a matrix of beta values
+panel_meth_from_beta(
+ mat,
+ chrom,
+ start,
+ end,
+ cov,
+ props,
+ cores,
+ min_samples = 1,
+ max_sd = 1
+)Matrix of beta values. Rows are CpGs, columns are samples
Vector of chromosome names
Vector of CpG start positions
Vector of CpG end positions
Vector of coverage values to give each CpG site. If a matrix is provided, coverage is calculated as the sum of reads for each site.
Number of cores to use for calculating HDI
Minimum number of samples that must have a non-NA value for a CpG site to be included in panel
Maximum standard deviation of methylation for a site to be included in panel.
panel_meth_from_counts.RdMake CAMDAC methylation panel from allele counts +Methylation fractions are obtained by summing M and UM reads across samples
+panel_meth_from_counts(
+ ac_files,
+ ac_props = NULL,
+ min_coverage = 3,
+ min_samples = 1,
+ max_sd = 1,
+ drop_snps = FALSE,
+ cores = 5
+)Allele count files from CAMDAC
Proportions of each sample to use in panel. If NULL, samples are weighted by their +total number of reads, which equals the sum of M and UM counts. If samples are NA, then +proportions are redistributed.
Minimum coverage for a sample's site to be included in panel
Minimum number of samples with coverage for a site to be included in panel
Maximum standard deviation of methylation for a site to be included in panel
Boolean. If TRUE, drop per-sample CG-SNPs (BAF < 0.1 or BAF > 0.9) from panel
Number of cores to use for calculating HDI
pipeline.RdCAMDAC analysis pipeline
+pipeline(tumor, germline, infiltrates, origin, config)Tumor CamSample() object for deconvultion.
Patient-matched normal CamSample() object. May be NULL if tumor has CNA calls already.
Normal CamSample() as a proxy for infiltrating normal methylation.
Normal CamSample() representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
pipeline_rrbs.RdCall CAMDAC for a tumor and patient-matched normal sample
+pipeline_rrbs(tumor, germline, infiltrates, origin, config)Tumor CamSample object for deconvultion.
Patient-matched normal CamSample object. May be NULL if tumor has CNA calls already.
Normal CamSample as a proxy for infiltrating normal methylation.
Normal CamSample representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
pipeline_wgbs.RdRun CAMDAC WGBS analysis on a bulk tumor and patient-matched tissue-matched tumor-adjacent normal sample.
+pipeline_wgbs(
+ tumor,
+ germline = NULL,
+ infiltrates = NULL,
+ origin = NULL,
+ config
+)Tumor CamSample object for deconvultion.
Patient-matched normal CamSample object. May be NULL if tumor has CNA calls already.
Normal CamSample as a proxy for infiltrating normal methylation.
Normal CamSample representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
plot_2d_density.Rdplot_2d_density
+plot_2d_density(dt, path)Data table with methylation information per CpG
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions.
plot_BAF_and_LogR.RdPlot BAF and logR profiles with ggplot
+plot_BAF_and_LogR(dt, outfile, downsample = 100000)data.frame with methylation info
character srting with output pdf filename +Saves a pdf w/ methylation rate distribution, biases at polymorphic and +non-polymorphic CG/CCGG and coverage distribution
plot_SNP_info.Rdplot_SNP_info plots SNP QC
plot_SNP_info(dt, outfile, min)data.table with SNP info
character srting with output pdf filename
plot_methylation_info.RdCreates table grob in format that is most common for my usage.
+plot_methylation_info(df_sample, outfile)data.frame with methylation info
character srting with output pdf filename
Data.table that the grob will be made out of
Title for display
Fontsize for title. Default is 14 (goes well with my_theme)
pdf w/ methylation rate distribution, biases at polymorphic and non-polymorphic CG/CCGG and coverage distribution
+plot_methylation_info returns the df_sample with annotated q-value for each CpG
plot_methylation_info_with_anno.RdPlot methylation information
+plot_methylation_info_with_anno(dt, path, bulk)Data table with methylation information per CpG
Character path variable pointing to the desired working directory.
Logical determining whether the bulk or purified tumour is to be plotted
plot_normal_SNP_info.RdPlot plots SNP QC
+plot_normal_SNP_info(dt, outfile, min)data.table with SNP info
character srting with output pdf filename
preprocess_asm.RdPreprocess a list of CamSample objects for ASM analysis
+preprocess_asm(sample_list, config)List of CamSample objects.
CamConfig object.
preprocess_wgbs.RdPreprocess a list of CamSample objects for analysis
+preprocess_wgbs(sample_list, config)List of CamSample objects.
CamConfig object.
remove_low_cov_singletons.RdRemove low coverage singletons outliers
+remove_low_cov_singletons(dt_sample_SNPs, min)round2.RdRound numerical values to 'n' digits
+Round numerical values to 'n' digits
+Round numerical values to 'n' digits
+Round numerical values to 'n' digits
+round2(x, digits)
+
+round2(x, digits)
+
+round2(x, digits)
+
+round2(x, digits)Numerical vector containing the numbers to round
Numerical value representing the number of decimal digits to retain
rounded numerical vector
+ + +rounded numerical vector
+ + +rounded numerical vector
+ + +rounded numerical vector
+run_ASCAT.m.Rdrun_ASCAT.m
run_ASCAT.m(
+ patient_id,
+ sample_id,
+ sex,
+ patient_matched_normal_id = NULL,
+ path,
+ path_to_CAMDAC,
+ build,
+ min_normal = 10,
+ min_tumour = 1,
+ n_cores = 1,
+ reference_panel_coverage = NULL
+)Character variable containting the patient id number
Character variable with the (control or tumour) sample_id
Character variable with the patient expressed as "XX" for female +or "XY" for male. +This is important for copy number profiling. If sex is unknown, put "XY" for now, +then look at the allelic imbalance (BAF) on X in the germline outside pseudo- +autosomal regions. If there are little to no heterozygous SNPs, the sample is likely male.
Character variable with the sample ID of the matched normal control
Character path variable pointing to the desired working directory. +This is where the output will be stored +IMPORTANT: The function output directory will be the in the path variable working +directory under "./Copy_number/sample_id/".
Character variable containting the path to the CAMDAC dir +including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the minimum counts for germline +SNPs to be included (default:1)
Numerical value correspdonding to the minimum counts in the tumour +sample for germline SNPs to be included (default:10)
Numerical value correspdonding to the number of cores for parallel processing
Path to the reference panel for the coverage.
Three text files with all the CpG loci and their SNP and/or CpG methylation info
+run_methylation_data_processingrun_methylation_data_processing.RdFilter bulk tumour and normal methylation data, get methylation rate highest density interval (HDI)
+and plot raw methylation info
+run_methylation_data_processing
run_methylation_data_processing(
+ patient_id,
+ sample_id,
+ normal_infiltrates_proxy_id,
+ normal_origin_proxy_id,
+ path,
+ min_normal = 10,
+ min_tumour = 3,
+ n_cores,
+ reference_panel_normal_infiltrates = NULL,
+ reference_panel_normal_origin = NULL
+)Character variable containting the patient ID
Character variable with the (control or tumour) sample ID
Character variable with the sample ID of +the tissue-matched normal acting as proxy for the tumour infiltrating +normal cells. Ideally, this is a patient and tissue-matched tumour adjacent normal sample.
Character variable with the sample ID +of the normal to be used as a proxy for the tumour cell of origin in +differential methylation analyses.
Character path variable pointing to the desired working directory. +This is where the output will be stored.
Numerical value correspdonding to the minimum counts threshold for +the normal CpGs to be included
Numerical value correspdonding to the minimum counts threshold +in the tumour sample CpGs inclusion
Numerical value correspdonding to the number of cores for parallel processing
Default is NULL. Character string with the complete +path to a reference methylation profile for the tumour normal infiltrates as a .fst file.
Default is NULL. Character string with the complete +path to your reference methylation profile for the tumour cell of origin as a .fst file.
+If a patient-matched proxy for the normal infiltrates and/or the normal cell of origin is not +available, a reference panel may be constructed from different individuals and used as a substitute.
+The reference samples should be at the very least sex-matched.
+The reference should be saved as a .fst file with the following columns:
+CHR start end M_n UM_n m_n cov_n
+
where each row is a CpG or CCpGG with coordinates CHR:start-end +The start and end columns correspond to the 5'-C and 3'-G coordinate, respectively. +M_n is the number of reads supporting of the methylated allele +UM_n is the number of reads supporting of the unmethylated allele +m_n is the normal methylation rate (M_n / (M_n+UM_n)) +cov_n is the total CpG methylation informative reads counts (M_n+UM_n)
GRanges object in .RData file
+sort_genomic_dt.Rdsort_genomic_dt
+Sort a data table with genomic coordinates
+sort_genomic_dt(dt, with_chr = F)
+
+sort_genomic_dt(dt, with_chr = F)An object that is a data.table
A boolean to indicate whether the chrom field has UCSC (TRUE) or NCBI (FALSE) format
split_segments_gr.RdSplit genome into segments for allele counting
+split_segments_gr(segments_file, n_seg_split)An RDS file containing a GRanges object with each chromosome region to split
An integer to split each chromosome segment
Copy-number Aware Methylation Deconvolution Analysis of Cancer (CAMDAC) is an R library for deconvolving bulk tumor DNA methylation (bisulfite) sequencing data (Larose Cadieux et al., 2022, bioRxiv).
+ +A CAMDAC container is available on dockerhub for use with Docker, Singularity or Apptainer:
+docker pull nmensah5/camdac:latest
+echo "library(CAMDAC)" > commands.R
+docker run -v $(pwd):/app nmensah5/camdac:latest Rscript commands.RYou can install CAMDAC and its dependencies from an R console:
+
+install.packages("remotes")
+remotes::install_github("VanLoo-lab/CAMDAC")We provide pre-built reference datasets for hg38 and hg19. These files are required to run CAMDAC for either RRBS or WGBS analysis from the Zenodo repository: (10565423). An R getter function is provided for convenience:
+
+CAMDAC::download_pipeline_files(bsseq = "rrbs", directory = "./refs")
+CAMDAC::download_pipeline_files(bsseq = "wgbs", directory = "./refs")For WGBS analysis, CAMDAC requires the java command line utility to be available in the system PATH.
With reference files downloaded, run the tumor-normal deconvolution pipeline with test data:
++[!NOTE]
+We provide downsampled BAM files for testing the pipeline. For representative results, please use your own BAM files.
+library(CAMDAC)
+
+tumor_bam <- system.file("testdata", "tumour_beds_min.sorted.bam", package = "CAMDAC")
+normal_bam <- system.file("testdata", "normal_beds_min.sorted.bam", package = "CAMDAC")
+
+# Select samples for basic tumor-normal analysis
+tumor <- CamSample(id = "T", sex = "XY", bam = tumor_bam, patient_id="readme")
+normal <- CamSample(id = "N", sex = "XY", bam = normal_bam, patient_id="readme")
+
+# Configure pipeline
+config <- CamConfig(
+ outdir = "./validation/results/test_readme/", bsseq = "rrbs", lib = "pe",
+ build = "hg38", refs = "./refs", n_cores = 1, cna_caller='ascat',
+ min_cov=1, # Minimum tumour coverage at 1 for testing.
+ min_normal_cov=1, # Minimum normal coverage at 1 for testing.
+ min_mapq=1 # Minimum MAPQ at 1 for testing.
+)
+
+# Run CAMDAC
+CAMDAC::pipeline(
+ tumor, germline = normal, infiltrates = normal, origin = normal, config
+)For a more detailed walkthrough with test data, see vignette("pipeline").
To contribute to CAMDAC, fork the repository and install the development dependencies with remotes::install_dev_deps('.').
After making your changes, run the build and test commands listed in vignette("contributing").
Finally, submit a pull request with the changes on your fork.
+pipeline() function.CamConfig.RdSet CAMDAC configuration
+CamConfig(
+ outdir,
+ bsseq,
+ lib,
+ build,
+ n_cores = 1,
+ regions = NULL,
+ refs = NULL,
+ n_seg_split = 50,
+ min_mapq = 1,
+ min_cov = 1,
+ min_normal_cov = 10,
+ overwrite = FALSE,
+ cna_caller = "battenberg",
+ cna_settings = NULL
+)A path to save CAMDAC results. The results folder structure +follows the format PATIENT/DATASET/SAMPLE/.
Bisulfite sequencing platform. Choose between "wgbs" or "rrbs".
Bisulfite sequencing library. Choose "pe" for paired end, "se" for single end.
Reference genome build. Choose "hg38" or "hg19".
Number of cores to process CAMDAC data in parallel wherever possible.
A BED file with regions to restrict the analysis to
Path to CAMDAC reference files. If this is not given, CAMDAC searches the +environment variable CAMDAC_PIPELINE_FILES. If this is not set, CAMDAC searches recursively in the current +working directory.
Minimum mapping quality filter used in cmain_allele_counts().
Minimum coverage filter for: DNA methylation, Normal SNP selection.
Config to overwrite files if they already exist.
The CNA caller to use. "ascat" or "battenberg". Default is "battenberg"
A list of settings to pass to the CNA caller. rho, psi, java, beaglemaxmem
CamSample.RdBuild CAMDAC sample object
+CamSample(id, sex, bam = NULL, patient_id = "P")Unique identifier for the sample
The sex of the patient, "XX" or "XY". Required for CNA calling.
Sample BAM file. If not given, CAMDAC expects files linked with attach_output.
An identifier for the patient
HDIofICDF.RdHDI of ICDF
+HDIofICDF(ICDFname, credMass = 0.99, tol = 0.0001, ...)The inverse cumulative density function of the distribution.
The desired mass of the HDI region.
Tolerance parameter for optimisation. the lower the tolerance,the +longer the optimisation, but the higher the accuracy. +According to CAMDAC RRBS comments, tol=1e-4 gives values +of the same accuracy as our max resolution. +This function is adapted from Greg Snow's TeachingDemos package +E.g.Determine HDI of a M=30 and UM=12 CpG +Adding 1 to shape parameter ensures uniform beta(1,1) is updated with our counts +HDIofICDF(qbeta,shape1 = 30+1 , shape2 = 12+1 )
Highest density interval (HDI) limits in a vector.
+HDIofMCMC.RdHDI of MCMC
+HDIofMCMC(M_b, UM_b, M_n, UM_n, p, CN, CN_n, credMass = 0.99)counts methylated in the tumour
counts unmethylated in the tumour
counts methylated in the normal
counts unmethylated in the normal
tumour purity
total tumour copy number
total normal copy number
default is 0.99 +credMass is a scalar between 0 and 1, indicating the mass within the +credible interval that is to be estimated.
Value: HDIlim is a vector containing the limits of the HDI
+HDIofMCMC_mt.RdComputes highest density interval from a sample of representative values, +estimated as shortest credible interval for a unimodal distribution
+HDIofMCMC_mt(M_b, UM_b, M_n, UM_n, p, CN, credMass = 0.99)counts methylated in the tumour
counts unmethylated in the tumour
counts methylated in the normal
counts unmethylated in the normal
tumour purity
total tumour copy number
default is 0.99 +credMass is a scalar between 0 and 1, indicating the mass within the +credible interval that is to be estimated.
total normal copy number
Value: HDIlim is a vector containing the limits of the HDI
+LogR_correction.RdCorrect logR for msp1 fragment size bias and GC content
+LogR_correction(
+ dt_sample,
+ dt_SNPs,
+ build,
+ chr_names,
+ min_normal,
+ fragments_file,
+ replic_timing_file_prefix,
+ n_cores
+)Allelecounts output as a data.table
Allelecounts output subset to QC'ed SNP positions
Character variable corresponding to the reference genome version used for alignment
Character variable with the seqlevels.
Numerical with the minimum normal coverage threshold
CAMDAC reference MspI fragments file
CAMDAC reference replication timing files path and file name prefix
Numerical value correspdonding to the number of cores for parallel processing
annotate_copy_number.Rdannotate_copy_number returns the data.table dt_sample annotated with allele-specific copy numbers
annotate_copy_number(dt_sample, seg, rm_sex_chrom = FALSE)data.table object with each CpG and their coverage, counts methylated and methylation rate
ASCAT.m copy number segements object
Logical indicating if you would like to remove sex chrom from downstream analyses
A dataframe for each sample_id with the copy number calls added
+ascat.m.plotRawData.RdPlot tumour and germline BAF and LogR
+ascat.m.plotRawData(ASCATobj, raw_LogR, pch, cex, lim_logR)an ASCAT object (e.g. data structure from ascat.loadData)
vector with the LogR values before correction
type of data points in plot
size of data points in plot
y-axis limits on logR plot
Produces png files showing the logR and BAF values for tumour and germline samples
+ascat.m.plotSegmentedData.RdPlot segmentated BAF LogR
+ascat.m.plotSegmentedData(ASCATobj, lim_logR = 2)an ASCAT object (e.g. data structure from ascat.loadData)
Produces png files showing the logR and BAF values for tumour and germline samples
+ascat.plotRawData.flags.RdPlot BAF LogR
+ascat.plotRawData.flags(ASCATobj, pch, cex, lim_logR)an ASCAT object (e.g. data structure from ascat.loadData)
type of data points in plot
size of data points in plot
y-axis limits on logR plot
Produces png files showing the logR and BAF values for tumour and germline samples
+asm_pipeline.RdRun allele-specific methylation analysis pipeline
+asm_pipeline(tumor, germline = NULL, infiltrates = NULL, origin = NULL, config)CamSample object for tumor sample.
CamSample object for germline sample. Used for CNA calling.
CamSample object for infiltrating normal sample. Used for deconvolution.
CamSample object for cell of origin sample. Used for differential methylation.
CamConfig object.
attach_output.RdManually assign output file to CAMDAC sample
+attach_output(sample, config, code, file)CamSample object
CamConfig object
Code for output file. See vignettes("output") for descriptions.
Path to file to copy to expected location
bin_CpGs.Rdbin_CpGs returns the df with the annotation for each CpG
bin_CpGs(path, patient_id, sample_id, dt, anno_list, n_cores)Character string of the output directory
Character string containting the patient ID
Character string containting the sample ID.
data.table where each CG is a row with DMP info.
A data.table object containing annotated genomic bins including +genes, exons, introns, UTRs, CGI, CGI shores, CGI shelves, promoters or enhancers
number of cores for parallel processing
A dataframe for each sample_id with the copy number calls added
+calculate_m_t_hdi.RdCalculate HDI by simulation
+calculate_m_t_hdi(meth_c, n_cores, itersplit = 100000)call_dmps.RdCall differentially methylated positions
+call_dmps(
+ pmeth,
+ nmeth,
+ effect_size = 0.2,
+ prob = 0.99,
+ itersplit = 500000,
+ ncores = 5
+)call_dmr_routine.RdFunction to call DMRs on a camdac dmp dataset
+call_dmr_routine(
+ tmeth_dmps,
+ regions_annotations,
+ min_DMP_counts,
+ min_consec_DMP
+)camdac_to_battenberg_prepare_wgbs.Rdcamdac_to_battenberg_prepare_wgbs converts CAMDAC allele counter results to a format for processing.
camdac_to_battenberg_prepare_wgbs(
+ tumour_prefix,
+ normal_prefix,
+ camdac_tsnps,
+ outdir
+)CAMDAC tumour allele counts filepath. Expected *.gz
CAMDAC normal allele couts filepath. Expected *.gz
CAMDAC tumour-normal-snps object. Expected *.gz
allelecounter formatted-file output directory.
File handle for allele counter file generated
+cmain_bind_snps.RdCombing tumour-normal SNP file for CNA analysis (ASCAT or BATTENBERG)
+cmain_bind_snps(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_call_cna.RdConfig determines whether ASCAT or Battenberg is used
+cmain_call_cna(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_call_dmps.RdSingle-sample DMP calling on CAMDAC-deconvolved data
+cmain_call_dmps(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_call_dmrs.RdSingle-sample DMR calling on CAMDAC DMP data
+cmain_call_dmrs(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_count_alleles.RdCount alleles
+cmain_count_alleles(sample, config)A camdac sample object
A camac allele object
cmain_deconvolve_methylation.RdDeconvolve methylation
+cmain_deconvolve_methylation(tumour, normal, config)A camdac sample object
A camdac sample object
A camdac config object
cmain_make_methylation_profile.RdPre-process methylation from allele counts for CAMDAC deconvolution
+cmain_make_methylation_profile(sample, config)A camdac sample object
A camdac config object
cmain_make_snps.RdFormat and save SNP file for CNA analysis (ASCAT or BATTENBERG)
+cmain_make_snps(sample, config)A camdac sample object
A camdac config object
cmain_run_ascat.RdExpects SNP profiles to have been created using cmain_make_snp_profiles
cmain_run_ascat(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
cmain_run_battenberg.RdExpects SNP profiles to have been created using cmain_make_snp_profiles
cmain_run_battenberg(tumour, config)A camdac sample object
A camdac config object
A camdac sample object
collapse_cpg_to_dmr.RdSummarise CG stats per DMR
+collapse_cpg_to_dmr(dt)compute_tumour_methylome.Rdcompute_tumour_methylome returns the data.table dt annotated with
+CAMDAC pure tumour methylation rates
compute_tumour_methylome(dt, p, min_cov_t = 3, sex, build)data.table object with each CpG and their coverage, counts methylated, +methylation rate and copy number and matched normal methylation info
Numerical - Sample purity estimates
Numerical - Minimum tumour coverage
Character variable with the patient expressed as "XX" for female or "XY" for male.
Character variable corresponding to the reference genome used for alignment.
A dataframe for each sample_id with the tumour methylome added
+cwrap_asm_get_allele_counts.RdCount alleles for reads phased to SNPs in a BAM file
+cwrap_asm_get_allele_counts(
+ bam_file,
+ snps_gr,
+ loci_dt,
+ paired_end,
+ drop_ccgg,
+ min_mapq = min_mapq,
+ min_cov = min_cov
+)Path to BAM file
GRanges object with heterozygous SNP loci for phasing
Data table with CAMDAC CpG loci from reference files
Logical indicating if BAM is paired end
Logical indicating if CCGG should be dropped (i.e. rrbs mode)
Minimum mapping quality to consider a read
Minimum coverage to consider a read
A list with three slots: stats, qnames and asm_cg. stats describes counts of reads phased, +qnames determines which SNPs each read was phased to and asm_cg is the data table with read counts
+download_pipeline_files.RdCAMDAC pipeline files are required for analysis. This function downloads the files to +the output directory and unpacks them. By default, CAMDAC searches for the files in the +environment variable CAMDAC_PIPELINE_FILES. If this is missing, the current directory is used.
+CAMDAC pipeline files are required for analysis. This function downloads the files to +the output directory and unpacks them. By default, CAMDAC searches for the files in the +environment variable CAMDAC_PIPELINE_FILES. If this is missing, the current directory is used.
+download_pipeline_files(bsseq, directory = NULL, quiet = TRUE)
+
+download_pipeline_files(bsseq, directory = NULL, quiet = TRUE)Optional. Directory to download files to.
Sequencing assay. Either wgbs or rrbs.
format_methylation_dfformat_methylation_df.RdFormat methylation rates
+format_methylation_df
format_methylation_df(
+ dt,
+ sample_id,
+ normal_ids,
+ path_output,
+ n_cores,
+ suffix,
+ trim = FALSE
+)data.table containing the methylation information for each CpG
sample ID
sample ID of normal sample(s)
output directory
number of threads for HDI calculation
string containing the column names suffix for normal samples +This is to distinguish between the proxy supplied for the normal infiltrates +for use in deconvolution and the normal cell of origin for use in DMP/DMR calling
Logical value establishing whether regions with extremely high coverage be trimmed or not
A GRanges object with all the CpG loci, their coverage, counts methylated and methylation rate
+format_outputformat_output.RdFormat output nucleotide counts
+format_output
format_output(
+ patient_id,
+ sample_id,
+ sex,
+ is_normal = FALSE,
+ path,
+ path_to_CAMDAC,
+ build
+)Character variable containting the patient id number
Character variable with the sample ID
Character variable with the patient expressed as "XX" for female or "XY" for male.
Logical flag set to false if the sample to be formatted is normal or tumour
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions. +Do not alter the output directory structure while running CAMDAC.
Character variable containting the path to the CAMDAC directory +including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38". +is desired in addition to GRanges object in .RData file
Concatenated SNP and CpG information
+get_DMPs.Rdget_DMPs returns a df with annotated statistics for each CpG
get_DMPs(path, patient_id, sample_id, df, prob = 0.99, n_cores)Complete path to the CAMDAC methylation output directory +fir this sample
Character string containting the patient number
Character variable with the tumour sample_id
A data.table with pure, bulk and normal methylation info
Numerical value representing the threshold for statistically +significant DMP (default is p=0.99)
Number of cores to do the statistical testing over
A data.table object with all the CpG loci, their coverage, counts +methylated and methylation rate
+get_DMRs.Rdannotate_DMRs returns the df with the annotation for each CpG
get_DMRs(
+ path,
+ patient_id,
+ sample_id,
+ dt,
+ anno_list,
+ min_DMP_counts,
+ min_consec_DMP,
+ n_cores,
+ bulk = FALSE
+)Character string of the output directory
Character string containting the patient_id ID
Character string containting the sample ID.
dataframe where each CG is a row with DMP info.
A data.table object containing annotated genomic bins including +genes, exons, introns, UTRs, CGI, CGI shores, CGI shelves, promoters or enhancers
Numerical - number of DMPs required in a DMR
Numerical - number of consecutive DMPs required in a DMR
number of cores for parallel processing
A dataframe for each sample_id with the copy number calls added
+get_allele_countsget_allele_counts.RdCompile allele counts at SNPs and at CpGs for bisulfite sequencing data
+get_allele_counts
get_allele_counts(
+ i,
+ patient_id,
+ sample_id,
+ sex,
+ bam_file,
+ mq = 0,
+ path,
+ path_to_CAMDAC,
+ build = NULL,
+ n_cores,
+ test = FALSE,
+ paired_end = TRUE,
+ segments_bed = NULL
+)Integer loop index. The function must be run with all values from 1 to 25, each containing +1/25th of the RRBS covered genome.
Character variable containting the patient id
Character variable with the sample id
Character variable with the patient sex expressed as "XX" for female or "XY" for male.
Character variable with the full bam file name and path
Character variable or numeric containting the mapping quality treshold to be used. +For RRBS, set mq=0. Read mapping validity is based on read start site and nucleotides rather than mq.
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions. +Do not alter the output directory structure while running CAMDAC. +The function output of this function will be a sub-directory of the path variable under +"./Allelecounts/sample_id/". Do not change the directory structure as subsequent functions will +look for files in this directory.
Character variable containting the CAMDAC installation path (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the number of cores for parallel processing
Logical value indicating whether this is a quick test run with data subsampling
One .fst file including methylation info at CpGs and BAF and depth of coverage at +SNPs for the ith subset of RRBS loci
+get_cluster_counts.RdCount CpGs within DMP annotations
+get_cluster_counts(dt)get_differential_methylation.Rdget_differential_methylation
get_differential_methylation(
+ patient_id,
+ sample_id,
+ sex,
+ normal_origin_proxy_id,
+ path,
+ path_to_CAMDAC,
+ build,
+ effect_size = 0.2,
+ prob = 0.99,
+ min_DMP_counts_in_DMR = 5,
+ min_consec_DMP_in_DMR = 4,
+ n_cores,
+ reseg = FALSE,
+ bulk = FALSE
+)Character variable containting the patient id number
Character variable with the tumour sample_id
Character variable with the patient expressed as "XX" for +female or "XY" for male.
Character variable with the sample ID +of the normal to be used as a proxy for the tumour cell of origin in
Character path variable pointing to the desired working +directory. This is where the output will be stored.
Character variable containting the path to the CAMDAC +directory including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome +used for alignment. CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical containting the minimum tumour-normal methylation +difference (default is 0.2)
Numerical value representing the threshold for statistically +significant DMP (default is p=0.99)
Numerical value representing the number of +DMPs required in a DMR
Numerical value representing the number of +consecutive DMPs required in a DMR
Numerical value correspdonding to the number of cores +for parallel processing
Logical value should be set to FALSE. Multi-sample re-segmentation of +the copy number profiles will be available in future versions of CAMDAC.
Default is FALSE unless you want bulk DMP/DMR calls in addition +to CAMDAC pure tumour differential methylation analysis
+Note: +#' Annotation include: +CGI (including shore and shelves) +gene body (intragenic, 5UTR, 3UTR, intron, exon) +promoter (2kb upstream and 500 downstream any UCSC annotated gene) +enhancer (vista and FANTOM5 annotation)
Biologically significant DMPs, DMRs
+get_msp1_fragments.Rdget msp1 fragments
+get_msp1_fragments(dt, build, path_to_CAMDAC, outfile)data.table object with containing all covered CCGGs in the sample
Character, Either "hg19", "hg38", "GRCH37","GRCH38"
Character string containting the path to the CAMDAC dir including +dir name e.g. "~/CAMDAC/"
character srting with output filename
get_pure_tumour_methylation.Rdget_pure_tumour_methylation
get_pure_tumour_methylation(
+ patient_id,
+ sample_id,
+ sex,
+ normal_infiltrates_proxy_id,
+ path,
+ path_to_CAMDAC,
+ build,
+ n_cores,
+ reseg = FALSE
+)Character variable containting the patient id number
Character variable with the (control or tumour) sample_id
Character variable with the patient expressed as "XX" for +female or "XY" for male.
Sample ID of the matched normal control
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions.
Character variable containting the path to the CAMDAC +directory including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome +used for alignment. CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the number of cores +for parallel processing
Logical value should be set to FALSE. Multi-sample re-segmentation of +the copy number profiles will be available in future versions of CAMDAC.
+Note: +#' Annotation include: +CGI (including shore and shelves) +gene body (intragenic, 5UTR, 3UTR, intron, exon) +promoter (2kb upstream and 500 downstream any UCSC annotated gene) +enhancer (vista and FANTOM5 annotation)
CAMDAC purified tumour methylation rates
+get_reference_files.RdGet CAMDAC reference files from config
+get_reference_files(config, type_folder, glob = NULL)helper_camdac_pileup.RdCache existing CAMDAC results into a sub-directory so that the current ones can be +overwritten by the refitting pipeline +Decided this is unnecessary as the initial results were so wrong. +Exported only for development
+helper_camdac_pileup(bam_file, seg, loci_dt)
+ All functions+ + |
+ |
|---|---|
| + + | +Set CAMDAC configuration |
+
| + + | +Build CAMDAC sample object |
+
| + + | +Manually assign output file to CAMDAC sample |
+
| + + | +Bind SNPs |
+
| + + | +Call CNA |
+
| + + | +Call tumour-normal DMPs |
+
| + + | +Call tumour-normal DMRs |
+
| + + | +Count alleles |
+
| + + | +Deconvolve methylation |
+
| + + | +Make methylation |
+
| + + | +Make SNPs |
+
| + + | +Run ASCAT.m |
+
| + + | +Run battenberg |
+
| + + | +Download CAMDAC pipeline files |
+
| + + | +Get CAMDAC reference files from config |
+
| + + | +Parse ASCAT and Battenberg output directories to load CNA data |
+
| + + | +Load allele count files |
+
| + + | +Panel ASM from counts Basic function to create an ASM methylation panel from allele count or ASM meth files WARNING: In active development. |
+
| + + | +Make CAMDAC methylation panel from a matrix of beta values |
+
| + + | +Make CAMDAC methylation panel from allele counts Methylation fractions are obtained by summing M and UM reads across samples |
+
| + + | +CAMDAC analysis pipeline |
+
| + + | +Preprocess a list of CamSample objects for ASM analysis |
+
intervalWidth_r.RdCalculate intervalWidth_r
+intervalWidth_r(lowTailPr, ICDFname, credMass, ...)is R's name for the inverse cumulative density function +of the distribution.
is the desired mass of the HDI region.
is passed to R's optimize function, +the lower the tolerance,the longer the optimisation, but the higher the accuracy. +tol=1e-4 gives values of the same accurary as our max resolution +Return value: +Highest density iterval (HDI) limits in a vector. +Example of use: For determining HDI of a beta(30,12) distribution, type +HDIofICDF( qbeta , shape1 = 30+1 , shape2 = 12+1 ) +Notice that the parameters of the ICDFname must be explicitly named; +e.g., HDIofICDF( qbeta , 30+1 , 12+1 ) does not work. +Adapted and corrected from Greg Snow's TeachingDemos package. +Source fct outside of loop to speed up code
load_cna_data.RdSee "annotate_copy_number" func +A function required to load copy number for a tumour sample from camdac, either ascat or bb, +result should be: chrom, start, end, nA, nB, CN (total), seg_min and seg_max. +This should also include the purity and ploidy. As a separate list? +note that seg_min and seg_max are actually duplicates of the start and end columns, required to +keep track of the ascat segment positions after overalp +WARN: This drops sex chromosome but not implimented. Also should drops CN=0 (hom del) regions
+load_cna_data(tumour, config, data_type)load_panel_ac_files.RdLoad allele count files
+load_panel_ac_files(ac_files, cores = 5)Allele count files from CAMDAC
List of data tables for each allele counts file
+panel_asm_from_counts.RdPanel ASM from counts +Basic function to create an ASM methylation panel from allele count or ASM meth files +WARNING: In active development.
+panel_asm_from_counts(c1, c2)First ASM allele counts file to merge
Second ASM allele counts file to merge
panel_meth_from_beta.RdMake CAMDAC methylation panel from a matrix of beta values
+panel_meth_from_beta(
+ mat,
+ chrom,
+ start,
+ end,
+ cov,
+ props,
+ cores,
+ min_samples = 1,
+ max_sd = 1
+)Matrix of beta values. Rows are CpGs, columns are samples
Vector of chromosome names
Vector of CpG start positions
Vector of CpG end positions
Vector of coverage values to give each CpG site. If a matrix is provided, coverage is calculated as the sum of reads for each site.
Number of cores to use for calculating HDI
Minimum number of samples that must have a non-NA value for a CpG site to be included in panel
Maximum standard deviation of methylation for a site to be included in panel.
panel_meth_from_counts.RdMake CAMDAC methylation panel from allele counts +Methylation fractions are obtained by summing M and UM reads across samples
+panel_meth_from_counts(
+ ac_files,
+ ac_props = NULL,
+ min_coverage = 3,
+ min_samples = 1,
+ max_sd = 1,
+ drop_snps = FALSE,
+ cores = 5
+)Allele count files from CAMDAC
Proportions of each sample to use in panel. If NULL, samples are weighted by their +total number of reads, which equals the sum of M and UM counts. If samples are NA, then +proportions are redistributed.
Minimum coverage for a sample's site to be included in panel
Minimum number of samples with coverage for a site to be included in panel
Maximum standard deviation of methylation for a site to be included in panel
Boolean. If TRUE, drop per-sample CG-SNPs (BAF < 0.1 or BAF > 0.9) from panel
Number of cores to use for calculating HDI
pipeline.RdCAMDAC analysis pipeline
+pipeline(tumor, germline, infiltrates, origin, config)Tumor CamSample() object for deconvultion.
Patient-matched normal CamSample() object. May be NULL if tumor has CNA calls already.
Normal CamSample() as a proxy for infiltrating normal methylation.
Normal CamSample() representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
pipeline_rrbs.RdCall CAMDAC for a tumor and patient-matched normal sample
+pipeline_rrbs(tumor, germline, infiltrates, origin, config)Tumor CamSample object for deconvultion.
Patient-matched normal CamSample object. May be NULL if tumor has CNA calls already.
Normal CamSample as a proxy for infiltrating normal methylation.
Normal CamSample representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
pipeline_wgbs.RdRun CAMDAC WGBS analysis on a bulk tumor and patient-matched tissue-matched tumor-adjacent normal sample.
+pipeline_wgbs(
+ tumor,
+ germline = NULL,
+ infiltrates = NULL,
+ origin = NULL,
+ config
+)Tumor CamSample object for deconvultion.
Patient-matched normal CamSample object. May be NULL if tumor has CNA calls already.
Normal CamSample as a proxy for infiltrating normal methylation.
Normal CamSample representing cell of origin for tumor-normal differential methylation.
Configuration built with CamConfig().
plot_2d_density.Rdplot_2d_density
+plot_2d_density(dt, path)Data table with methylation information per CpG
Character path variable pointing to the desired working directory. +This is where the output will be stored and should be constant for all CAMDAC functions.
plot_BAF_and_LogR.RdPlot BAF and logR profiles with ggplot
+plot_BAF_and_LogR(dt, outfile, downsample = 100000)data.frame with methylation info
character srting with output pdf filename +Saves a pdf w/ methylation rate distribution, biases at polymorphic and +non-polymorphic CG/CCGG and coverage distribution
plot_SNP_info.Rdplot_SNP_info plots SNP QC
plot_SNP_info(dt, outfile, min)data.table with SNP info
character srting with output pdf filename
plot_methylation_info.RdCreates table grob in format that is most common for my usage.
+plot_methylation_info(df_sample, outfile)data.frame with methylation info
character srting with output pdf filename
Data.table that the grob will be made out of
Title for display
Fontsize for title. Default is 14 (goes well with my_theme)
pdf w/ methylation rate distribution, biases at polymorphic and non-polymorphic CG/CCGG and coverage distribution
+plot_methylation_info returns the df_sample with annotated q-value for each CpG
plot_methylation_info_with_anno.RdPlot methylation information
+plot_methylation_info_with_anno(dt, path, bulk)Data table with methylation information per CpG
Character path variable pointing to the desired working directory.
Logical determining whether the bulk or purified tumour is to be plotted
plot_normal_SNP_info.RdPlot plots SNP QC
+plot_normal_SNP_info(dt, outfile, min)data.table with SNP info
character srting with output pdf filename
preprocess_asm.RdPreprocess a list of CamSample objects for ASM analysis
+preprocess_asm(sample_list, config)List of CamSample objects.
CamConfig object.
preprocess_wgbs.RdPreprocess a list of CamSample objects for analysis
+preprocess_wgbs(sample_list, config)List of CamSample objects.
CamConfig object.
remove_low_cov_singletons.RdRemove low coverage singletons outliers
+remove_low_cov_singletons(dt_sample_SNPs, min)round2.RdRound numerical values to 'n' digits
+Round numerical values to 'n' digits
+Round numerical values to 'n' digits
+Round numerical values to 'n' digits
+round2(x, digits)
+
+round2(x, digits)
+
+round2(x, digits)
+
+round2(x, digits)Numerical vector containing the numbers to round
Numerical value representing the number of decimal digits to retain
rounded numerical vector
+ + +rounded numerical vector
+ + +rounded numerical vector
+ + +rounded numerical vector
+run_ASCAT.m.Rdrun_ASCAT.m
run_ASCAT.m(
+ patient_id,
+ sample_id,
+ sex,
+ patient_matched_normal_id = NULL,
+ path,
+ path_to_CAMDAC,
+ build,
+ min_normal = 10,
+ min_tumour = 1,
+ n_cores = 1,
+ reference_panel_coverage = NULL
+)Character variable containting the patient id number
Character variable with the (control or tumour) sample_id
Character variable with the patient expressed as "XX" for female +or "XY" for male. +This is important for copy number profiling. If sex is unknown, put "XY" for now, +then look at the allelic imbalance (BAF) on X in the germline outside pseudo- +autosomal regions. If there are little to no heterozygous SNPs, the sample is likely male.
Character variable with the sample ID of the matched normal control
Character path variable pointing to the desired working directory. +This is where the output will be stored +IMPORTANT: The function output directory will be the in the path variable working +directory under "./Copy_number/sample_id/".
Character variable containting the path to the CAMDAC dir +including dir name (e.g. "/path/to/CAMDAC/").
Character variable corresponding to the reference genome used for alignment. +CAMDAC is compatible with "hg19", "hg38", "GRCH37","GRCH38".
Numerical value correspdonding to the minimum counts for germline +SNPs to be included (default:1)
Numerical value correspdonding to the minimum counts in the tumour +sample for germline SNPs to be included (default:10)
Numerical value correspdonding to the number of cores for parallel processing
Path to the reference panel for the coverage.
Three text files with all the CpG loci and their SNP and/or CpG methylation info
+run_methylation_data_processingrun_methylation_data_processing.RdFilter bulk tumour and normal methylation data, get methylation rate highest density interval (HDI)
+and plot raw methylation info
+run_methylation_data_processing
run_methylation_data_processing(
+ patient_id,
+ sample_id,
+ normal_infiltrates_proxy_id,
+ normal_origin_proxy_id,
+ path,
+ min_normal = 10,
+ min_tumour = 3,
+ n_cores,
+ reference_panel_normal_infiltrates = NULL,
+ reference_panel_normal_origin = NULL
+)Character variable containting the patient ID
Character variable with the (control or tumour) sample ID
Character variable with the sample ID of +the tissue-matched normal acting as proxy for the tumour infiltrating +normal cells. Ideally, this is a patient and tissue-matched tumour adjacent normal sample.
Character variable with the sample ID +of the normal to be used as a proxy for the tumour cell of origin in +differential methylation analyses.
Character path variable pointing to the desired working directory. +This is where the output will be stored.
Numerical value correspdonding to the minimum counts threshold for +the normal CpGs to be included
Numerical value correspdonding to the minimum counts threshold +in the tumour sample CpGs inclusion
Numerical value correspdonding to the number of cores for parallel processing
Default is NULL. Character string with the complete +path to a reference methylation profile for the tumour normal infiltrates as a .fst file.
Default is NULL. Character string with the complete +path to your reference methylation profile for the tumour cell of origin as a .fst file.
+If a patient-matched proxy for the normal infiltrates and/or the normal cell of origin is not +available, a reference panel may be constructed from different individuals and used as a substitute.
+The reference samples should be at the very least sex-matched.
+The reference should be saved as a .fst file with the following columns:
+CHR start end M_n UM_n m_n cov_n
+
where each row is a CpG or CCpGG with coordinates CHR:start-end +The start and end columns correspond to the 5'-C and 3'-G coordinate, respectively. +M_n is the number of reads supporting of the methylated allele +UM_n is the number of reads supporting of the unmethylated allele +m_n is the normal methylation rate (M_n / (M_n+UM_n)) +cov_n is the total CpG methylation informative reads counts (M_n+UM_n)
GRanges object in .RData file
+sort_genomic_dt.Rdsort_genomic_dt
+Sort a data table with genomic coordinates
+sort_genomic_dt(dt, with_chr = F)
+
+sort_genomic_dt(dt, with_chr = F)An object that is a data.table
A boolean to indicate whether the chrom field has UCSC (TRUE) or NCBI (FALSE) format
split_segments_gr.RdSplit genome into segments for allele counting
+split_segments_gr(segments_file, n_seg_split)An RDS file containing a GRanges object with each chromosome region to split
An integer to split each chromosome segment