|
| 1 | +# Pangenome |
| 2 | + |
| 3 | +Sentieon Pangenome is a pipeline for alignment and variant calling from short-read DNA sequence data using pangenome graphs. The Pangenome pipeline leverages graph-based reference representations to improve alignment and variant calling accuracy, particularly in complex genomic regions with high sequence diversity. |
| 4 | + |
| 5 | +The pipeline uses the vg toolkit for pangenome alignment and structural variant calling. Sentieon tools are used for small variant calling, CNV detection, segdup variant calling, preprocessing and metrics collection. Specialized tools are used for HLA/KIR genotyping and repeat expansion analysis. |
| 6 | + |
| 7 | +The pipeline accepts as input unaligned reads in FASTQ format and outputs variants in VCF format, aligned reads in BAM format, and comprehensive quality control metrics. |
| 8 | + |
| 9 | +Pangenome is implemented using the Sentieon software package, which requires a valid license for use. Please contact info@sentieon.com for access to the Sentieon software and an evaluation license. |
| 10 | + |
| 11 | +## Prerequisites |
| 12 | + |
| 13 | +- Sentieon software package version 202503 or higher. |
| 14 | +- [vg] toolkit for pangenome alignment operations. |
| 15 | +- [KMC] version 3 or higher for k-mer counting. |
| 16 | +- [samtools] version 1.16 or higher for alignment operations. |
| 17 | +- [bcftools] version 1.10 or higher for VCF operations. |
| 18 | +- [MultiQC] version 1.18 or higher for metrics report generation. |
| 19 | +- [T1K] for HLA and KIR genotyping (optional). |
| 20 | +- [ExpansionHunter] for repeat expansion calling (optional). |
| 21 | +- [segdup-caller] for segmental duplication variant calling (optional). |
| 22 | + |
| 23 | +The executables will be accessed through the user's `PATH` environment variable. |
| 24 | + |
| 25 | +## Input data requirements |
| 26 | + |
| 27 | +### The Reference genome |
| 28 | + |
| 29 | +Pangenome will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present. |
| 30 | + |
| 31 | +### Pangenome graph files |
| 32 | + |
| 33 | +The pipeline requires several pangenome graph files: |
| 34 | +- **GBZ file**: The pangenome graph in GBZ format |
| 35 | +- **Haplotype file**: Haplotype information for the pangenome |
| 36 | +- **Snarls file**: VG snarls file for the GBZ file (created using `vg snarls`) |
| 37 | +- **XG file**: VG XG index file for the GBZ file (created using `vg convert`) |
| 38 | + |
| 39 | +### Model bundle |
| 40 | + |
| 41 | +A Sentieon model bundle containing machine learning models for variant calling is required. Model bundle files can be found in the [sentieon-models] repository. |
| 42 | + |
| 43 | +## Usage |
| 44 | + |
| 45 | +### Pangenome alignment and variant calling from FASTQ |
| 46 | + |
| 47 | +A single command is run to align reads to a pangenome graph, call small variants, structural variants, and CNVs: |
| 48 | + |
| 49 | +```sh |
| 50 | +sentieon-cli pangenome [-h] \ |
| 51 | + -r REFERENCE \ |
| 52 | + --gbz GBZ \ |
| 53 | + --hapl HAPL \ |
| 54 | + --snarls SNARLS \ |
| 55 | + --xg XG \ |
| 56 | + -m MODEL_BUNDLE \ |
| 57 | + --r1_fastq R1_FASTQ ... \ |
| 58 | + --r2_fastq R2_FASTQ ... \ |
| 59 | + --readgroups READGROUPS ... \ |
| 60 | + [-d DBSNP] \ |
| 61 | + [--known_sites KNOWN_SITES ...] \ |
| 62 | + [-t CORES] \ |
| 63 | + [--kmer_memory KMER_MEMORY] \ |
| 64 | + [--expansion_catalog EXPANSION_CATALOG] \ |
| 65 | + [--t1k_hla_seq T1K_HLA_SEQ] \ |
| 66 | + [--t1k_hla_coord T1K_HLA_COORD] \ |
| 67 | + [--t1k_kir_seq T1K_KIR_SEQ] \ |
| 68 | + [--t1k_kir_coord T1K_KIR_COORD] \ |
| 69 | + [--segdup_caller_genes SEGDUP_CALLER_GENES] \ |
| 70 | + [--dry_run] \ |
| 71 | + sample.vcf.gz |
| 72 | +``` |
| 73 | + |
| 74 | +The Pangenome pipeline requires the following arguments: |
| 75 | +- `-r REFERENCE`: the location of the reference FASTA file. A reference fasta index, ".fai" file, is also required. |
| 76 | +- `--gbz GBZ`: the pangenome graph file in GBZ format. |
| 77 | +- `--hapl HAPL`: the haplotype file for the pangenome. |
| 78 | +- `--snarls SNARLS`: the vg snarls file for the GBZ file. |
| 79 | +- `--xg XG`: the XG file for the GBZ file. |
| 80 | +- `-m MODEL_BUNDLE`: the location of the model bundle containing DNAscope and CNVscope models. |
| 81 | +- `--r1_fastq R1_FASTQ`: the R1 input FASTQ. Can be used with multiple files. |
| 82 | +- `--r2_fastq R2_FASTQ`: the R2 input FASTQ. Can be used with multiple files. |
| 83 | +- `--readgroups READGROUPS`: readgroup information for each FASTQ. An example argument is, `--readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"`. |
| 84 | +- `sample.vcf.gz`: the location of the output VCF file for small variants. The pipeline requires the output file end with the suffix, ".vcf.gz". The file path without the suffix will be used as the basename for other output files. |
| 85 | + |
| 86 | +The Pangenome pipeline accepts the following optional arguments: |
| 87 | +- `-d DBSNP`: the location of the Single Nucleotide Polymorphism database (dbSNP) VCF used to label known variants. A VCF index file is required. |
| 88 | +- `--known_sites KNOWN_SITES`: known sites for indel realignment in VCF format. VCF index files are required. |
| 89 | +- `-t CORES`: number of computing threads/cores to use. Default is all available cores. |
| 90 | +- `--kmer_memory KMER_MEMORY`: memory limit for KMC k-mer counting in GB. Default is 128 GB. |
| 91 | +- `--expansion_catalog EXPANSION_CATALOG`: ExpansionHunter variant catalog for repeat expansion calling. |
| 92 | +- `--t1k_hla_seq T1K_HLA_SEQ`: DNA HLA sequence FASTA file for T1K HLA calling. Requires `--t1k_hla_coord`. |
| 93 | +- `--t1k_hla_coord T1K_HLA_COORD`: DNA HLA coordinate FASTA file for T1K HLA calling. Requires `--t1k_hla_seq`. |
| 94 | +- `--t1k_kir_seq T1K_KIR_SEQ`: DNA KIR sequence FASTA file for T1K KIR calling. Requires `--t1k_kir_coord`. |
| 95 | +- `--t1k_kir_coord T1K_KIR_COORD`: DNA KIR coordinate FASTA file for T1K KIR calling. Requires `--t1k_kir_seq`. |
| 96 | +- `--segdup_caller_genes SEGDUP_CALLER_GENES`: Genes for segmental duplication calling. Example: 'CFH,CFHR3,CYP11B1,CYP2D6,GBA,NCF1,PMS2,SMN1,STRC'. |
| 97 | +- `-h`: print the command-line help and exit. |
| 98 | +- `--dry_run`: print the pipeline commands, but do not actually execute them. |
| 99 | + |
| 100 | +## Pipeline workflow |
| 101 | + |
| 102 | +The Pangenome pipeline consists of multiple stages executed across two DAGs: |
| 103 | + |
| 104 | +### First DAG |
| 105 | +1. **K-mer counting**: `kmc` counts k-mers from input FASTQ files |
| 106 | +2. **Sample-specific pangenome**: `vg haplotypes` creates a sample-specific pangenome using k-mer frequencies |
| 107 | +3. **Pangenome alignment**: `vg giraffe` aligns reads to the sample-specific pangenome |
| 108 | +4. **Read support computation**: `vg pack` computes read support for variants |
| 109 | +5. **Structural variant calling**: `vg call` identifies structural variants from read support |
| 110 | +6. **Linear alignment**: `vg surject` projects pangenome alignments back to linear reference |
| 111 | +7. **Duplicate marking**: Sentieon `Dedup` marks duplicate reads and computes metrics |
| 112 | +8. **Indel realignment**: Sentieon `Realigner` performs indel realignment |
| 113 | +9. **Small variant calling**: Sentieon `DNAscope` calls small variants (SNVs/indels) |
| 114 | +10. **Ploidy estimation**: Estimation of sample sex and ploidy for downstream processing |
| 115 | +11. **HLA/KIR genotyping**: T1K genotypes HLA and KIR loci (optional) |
| 116 | +12. **Segmental duplication calling**: segdup-caller identifies variants in segmental duplications (optional) |
| 117 | + |
| 118 | +### Second DAG |
| 119 | +1. **CNV calling**: Sentieon CNVscope calls copy number variants with sex-specific processing |
| 120 | +2. **Repeat expansion calling**: ExpansionHunter identifies repeat expansions (optional) |
| 121 | + |
| 122 | +## Pipeline output |
| 123 | + |
| 124 | +### List of output files |
| 125 | + |
| 126 | +The following files are output when processing WGS FASTQ: |
| 127 | +- `sample.vcf.gz`: Small variant calls (SNVs and indels) from DNAscope. |
| 128 | +- `sample_pangenome-aligned.bam`: Aligned, coordinate-sorted and duplicate-marked read data projected to the linear reference. |
| 129 | +- `sample_svs.vcf.gz`: Structural variant calls from vg call. |
| 130 | +- `sample_cnv.vcf.gz`: Copy number variant calls from CNVscope. |
| 131 | +- `sample_ploidy.json`: Estimated sample sex and ploidy data. |
| 132 | +- `sample_metrics`: A directory containing QC metrics for the analyzed sample. |
| 133 | + - `sample_metrics/{sample}.txt.alignment_stat.txt`: Metrics from the AlignmentStat algo. |
| 134 | + - `sample_metrics/{sample}.txt.base_distribution_by_cycle.txt`: Metrics from the BaseDistributionByCycle algo. |
| 135 | + - `sample_metrics/{sample}.txt.dedup_metrics.txt`: Metrics from the Dedup algo. |
| 136 | + - `sample_metrics/{sample}.txt.gc_bias*`: Metrics from the GCBias algo. |
| 137 | + - `sample_metrics/{sample}.txt.insert_size.txt`: Metrics from the InsertSizeMetricAlgo algo. |
| 138 | + - `sample_metrics/{sample}.txt.mean_qual_by_cycle.txt`: Metrics from the MeanQualityByCycle algo. |
| 139 | + - `sample_metrics/{sample}.txt.qual_distribution.txt`: Metrics from the QualDistribution algo. |
| 140 | + - `sample_metrics/{sample}.txt.wgs.txt`: Metrics from the WgsMetricsAlgo algo. |
| 141 | + - `sample_metrics/multiqc_report.html`: Collected QC metrics aggregated by MultiQC. |
| 142 | + |
| 143 | +### Optional output files |
| 144 | + |
| 145 | +When optional analyses are enabled: |
| 146 | +- `sample_hla/`: HLA genotyping results from T1K (when HLA files are provided). |
| 147 | +- `sample_kir/`: KIR genotyping results from T1K (when KIR files are provided). |
| 148 | +- `sample_segdups/`: Segmental duplication variant calls (when genes are specified). |
| 149 | +- `sample_expansion*`: Repeat expansion calls from ExpansionHunter (when a catalog is provided). |
| 150 | + |
| 151 | +[vg]: https://github.com/vgteam/vg |
| 152 | +[KMC]: https://github.com/refresh-bio/KMC |
| 153 | +[samtools]: https://www.htslib.org/ |
| 154 | +[bcftools]: http://samtools.github.io/bcftools/bcftools.html |
| 155 | +[MultiQC]: https://multiqc.info/ |
| 156 | +[T1K]: https://github.com/mourisl/T1K |
| 157 | +[ExpansionHunter]: https://github.com/Illumina/ExpansionHunter |
| 158 | +[segdup-caller]: https://github.com/Sentieon/segdup-caller |
| 159 | +[sentieon-models]: https://github.com/Sentieon/sentieon-models |
| 160 | + |
| 161 | +## Example data processing |
| 162 | + |
| 163 | +### Obtaining the pangenome graph files |
| 164 | + |
| 165 | +Download vg the public human pangenome from the HPRC repository |
| 166 | + |
| 167 | +```sh |
| 168 | +curl -LO "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz" |
| 169 | +curl -LO "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.hapl" |
| 170 | +``` |
| 171 | + |
| 172 | +Generate a snarls file from the .gbz with `vg snarls`. |
| 173 | +```sh |
| 174 | +vg snarls hprc-v1.1-mc-grch38.gbz > hprc-v1.1-mc-grch38.snarls |
| 175 | +``` |
| 176 | + |
| 177 | +Generate a xg file from the .gbz with `vg convert`. |
| 178 | +```sh |
| 179 | +vg convert -x --drop-haplotypes hprc-v1.1-mc-grch38.gbz > hprc-v1.1-mc-grch38.xg |
| 180 | +``` |
| 181 | + |
| 182 | +### Download the hg38 reference genome |
| 183 | + |
| 184 | +Download the reference fasta file and samtools faidx file |
| 185 | +```sh |
| 186 | +curl -L 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz' | \ |
| 187 | + gzip -dc > GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna |
| 188 | +curl -LO 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai' |
| 189 | +``` |
| 190 | + |
| 191 | +### Download an expansion catalog of clinically relevant expansions |
| 192 | + |
| 193 | +```sh |
| 194 | +curl -L 'https://github.com/Illumina/ExpansionHunter/raw/refs/tags/v5.0.0/variant_catalog/hg38/variant_catalog.json' > hg38_variant_catalog.json |
| 195 | +``` |
| 196 | + |
| 197 | +### Download t1k fasta files |
| 198 | + |
| 199 | +Download the two databases |
| 200 | +```sh |
| 201 | +perl t1k-build.pl -o hlaidx --download IPD-IMGT/HLA |
| 202 | +perl t1k-build.pl -o kiridx --download IPD-KIR |
| 203 | +``` |
| 204 | + |
| 205 | +Download the gencode gtf file for hg38 |
| 206 | +```sh |
| 207 | +curl -L 'https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz' | \ |
| 208 | + gzip -dc > gencode.v38.annotation.gtf |
| 209 | +``` |
| 210 | + |
| 211 | +Bulid the coordinate file |
| 212 | +```sh |
| 213 | +perl t1k-build.pl -o hlaidx -d hlaidx/hla.dat -g gencode.v38.annotation.gtf |
| 214 | +perl t1k-build.pl -o kiridx -d kiridx/kir.dat -g gencode.v38.annotation.gtf |
| 215 | +``` |
| 216 | + |
| 217 | +### Download a 30x WGS fastq |
| 218 | + |
| 219 | +```sh |
| 220 | +curl -LO 'https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R1.fastq.gz' |
| 221 | +curl -LO 'https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R2.fastq.gz' |
| 222 | +``` |
| 223 | + |
| 224 | +### Run the pangenome pipeline |
| 225 | + |
| 226 | +```sh |
| 227 | +sentieon-cli -v pangenome \ |
| 228 | + --reference GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \ |
| 229 | + --gbz hprc-v1.1-mc-grch38.gbz \ |
| 230 | + --hapl hprc-v1.1-mc-grch38.hapl \ |
| 231 | + --snarls hprc-v1.1-mc-grch38.snarls \ |
| 232 | + --xg hprc-v1.1-mc-grch38.xg \ |
| 233 | + --model_bundle pangenome.bundle \ |
| 234 | + --r1_fastq HG002.novaseq.pcr-free.30x.R1.fastq.gz \ |
| 235 | + --r2_fastq HG002.novaseq.pcr-free.30x.R2.fastq.gz \ |
| 236 | + --readgroups '@RG\tID:HG002-1\tSM:HG002\tPL:ILLUMINA' \ |
| 237 | + --expansion_catalog hg38_variant_catalog.json \ |
| 238 | + --segdup_caller_genes 'CFH,CFHR3,CYP11B1,CYP2D6,GBA,NCF1,PMS2,SMN1,STRC' \ |
| 239 | + --t1k_hla_seq hlaidx/data_dna_seq.fa \ |
| 240 | + --t1k_hla_coord hlaidx/data_dna_coord.fa \ |
| 241 | + --t1k_kir_seq kiridx/data_dna_seq.fa \ |
| 242 | + --t1k_kir_coord kiridx/data_dna_coord.fa \ |
| 243 | + HG002_pangenome_analysis.vcf.gz |
| 244 | +``` |
0 commit comments