Skip to content

Commit 5429ff9

Browse files
authored
Merge pull request #15 from Sentieon/dev
Merge Dev changes
2 parents 583e8b5 + 1e44e00 commit 5429ff9

File tree

12 files changed

+2161
-6
lines changed

12 files changed

+2161
-6
lines changed

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,9 @@ __pycache__/
33
dist/
44
.vscode/
55
secrets.env
6-
.*.swp
6+
.*.swp
7+
8+
# Other files
9+
notes.txt
10+
.secrets
11+
.DS_Store

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ The `sentieon-cli` supports the following global arguments:
5353
- [**DNAscope**](docs/dnascope.md) - DNAscope pipeline implementation for germline SNV and indel calling from short read data.
5454
- [**DNAscope LongRead**](docs/dnascope-longread.md) - DNAscope LongRead pipeline implementations for germline SNV and indel calling from long read data.
5555
- [**DNAscope Hybrid**](docs/dnascope-hybrid.md) - DNAscope short-long-hybrid pipeline.
56+
- [**Pangenome**](docs/pangenome.md) - DNAscope short-read alignment and variant calling using pangenome graphs.
5657

5758
## License
5859
Unless otherwise indicated, files in this repository are licensed under a BSD 2-Clause License.

docs/pangenome.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Pangenome
2+
3+
Sentieon Pangenome is a pipeline for alignment and variant calling from short-read DNA sequence data using pangenome graphs. The Pangenome pipeline leverages graph-based reference representations to improve alignment and variant calling accuracy, particularly in complex genomic regions with high sequence diversity.
4+
5+
The pipeline uses the vg toolkit for pangenome alignment and structural variant calling. Sentieon tools are used for small variant calling, CNV detection, segdup variant calling, preprocessing and metrics collection. Specialized tools are used for HLA/KIR genotyping and repeat expansion analysis.
6+
7+
The pipeline accepts as input unaligned reads in FASTQ format and outputs variants in VCF format, aligned reads in BAM format, and comprehensive quality control metrics.
8+
9+
Pangenome is implemented using the Sentieon software package, which requires a valid license for use. Please contact info@sentieon.com for access to the Sentieon software and an evaluation license.
10+
11+
## Prerequisites
12+
13+
- Sentieon software package version 202503 or higher.
14+
- [vg] toolkit for pangenome alignment operations.
15+
- [KMC] version 3 or higher for k-mer counting.
16+
- [samtools] version 1.16 or higher for alignment operations.
17+
- [bcftools] version 1.10 or higher for VCF operations.
18+
- [MultiQC] version 1.18 or higher for metrics report generation.
19+
- [T1K] for HLA and KIR genotyping (optional).
20+
- [ExpansionHunter] for repeat expansion calling (optional).
21+
- [segdup-caller] for segmental duplication variant calling (optional).
22+
23+
The executables will be accessed through the user's `PATH` environment variable.
24+
25+
## Input data requirements
26+
27+
### The Reference genome
28+
29+
Pangenome will call variants present in the sample relative to a high quality reference genome sequence. Besides the reference genome file, a samtools fasta index file (.fai) needs to be present.
30+
31+
### Pangenome graph files
32+
33+
The pipeline requires several pangenome graph files:
34+
- **GBZ file**: The pangenome graph in GBZ format
35+
- **Haplotype file**: Haplotype information for the pangenome
36+
- **Snarls file**: VG snarls file for the GBZ file (created using `vg snarls`)
37+
- **XG file**: VG XG index file for the GBZ file (created using `vg convert`)
38+
39+
### Model bundle
40+
41+
A Sentieon model bundle containing machine learning models for variant calling is required. Model bundle files can be found in the [sentieon-models] repository.
42+
43+
## Usage
44+
45+
### Pangenome alignment and variant calling from FASTQ
46+
47+
A single command is run to align reads to a pangenome graph, call small variants, structural variants, and CNVs:
48+
49+
```sh
50+
sentieon-cli pangenome [-h] \
51+
-r REFERENCE \
52+
--gbz GBZ \
53+
--hapl HAPL \
54+
--snarls SNARLS \
55+
--xg XG \
56+
-m MODEL_BUNDLE \
57+
--r1_fastq R1_FASTQ ... \
58+
--r2_fastq R2_FASTQ ... \
59+
--readgroups READGROUPS ... \
60+
[-d DBSNP] \
61+
[--known_sites KNOWN_SITES ...] \
62+
[-t CORES] \
63+
[--kmer_memory KMER_MEMORY] \
64+
[--expansion_catalog EXPANSION_CATALOG] \
65+
[--t1k_hla_seq T1K_HLA_SEQ] \
66+
[--t1k_hla_coord T1K_HLA_COORD] \
67+
[--t1k_kir_seq T1K_KIR_SEQ] \
68+
[--t1k_kir_coord T1K_KIR_COORD] \
69+
[--segdup_caller_genes SEGDUP_CALLER_GENES] \
70+
[--dry_run] \
71+
sample.vcf.gz
72+
```
73+
74+
The Pangenome pipeline requires the following arguments:
75+
- `-r REFERENCE`: the location of the reference FASTA file. A reference fasta index, ".fai" file, is also required.
76+
- `--gbz GBZ`: the pangenome graph file in GBZ format.
77+
- `--hapl HAPL`: the haplotype file for the pangenome.
78+
- `--snarls SNARLS`: the vg snarls file for the GBZ file.
79+
- `--xg XG`: the XG file for the GBZ file.
80+
- `-m MODEL_BUNDLE`: the location of the model bundle containing DNAscope and CNVscope models.
81+
- `--r1_fastq R1_FASTQ`: the R1 input FASTQ. Can be used with multiple files.
82+
- `--r2_fastq R2_FASTQ`: the R2 input FASTQ. Can be used with multiple files.
83+
- `--readgroups READGROUPS`: readgroup information for each FASTQ. An example argument is, `--readgroups "@RG\tID:HG002-1\tSM:HG002\tLB:HG002-LB-1\tPL:ILLUMINA"`.
84+
- `sample.vcf.gz`: the location of the output VCF file for small variants. The pipeline requires the output file end with the suffix, ".vcf.gz". The file path without the suffix will be used as the basename for other output files.
85+
86+
The Pangenome pipeline accepts the following optional arguments:
87+
- `-d DBSNP`: the location of the Single Nucleotide Polymorphism database (dbSNP) VCF used to label known variants. A VCF index file is required.
88+
- `--known_sites KNOWN_SITES`: known sites for indel realignment in VCF format. VCF index files are required.
89+
- `-t CORES`: number of computing threads/cores to use. Default is all available cores.
90+
- `--kmer_memory KMER_MEMORY`: memory limit for KMC k-mer counting in GB. Default is 128 GB.
91+
- `--expansion_catalog EXPANSION_CATALOG`: ExpansionHunter variant catalog for repeat expansion calling.
92+
- `--t1k_hla_seq T1K_HLA_SEQ`: DNA HLA sequence FASTA file for T1K HLA calling. Requires `--t1k_hla_coord`.
93+
- `--t1k_hla_coord T1K_HLA_COORD`: DNA HLA coordinate FASTA file for T1K HLA calling. Requires `--t1k_hla_seq`.
94+
- `--t1k_kir_seq T1K_KIR_SEQ`: DNA KIR sequence FASTA file for T1K KIR calling. Requires `--t1k_kir_coord`.
95+
- `--t1k_kir_coord T1K_KIR_COORD`: DNA KIR coordinate FASTA file for T1K KIR calling. Requires `--t1k_kir_seq`.
96+
- `--segdup_caller_genes SEGDUP_CALLER_GENES`: Genes for segmental duplication calling. Example: 'CFH,CFHR3,CYP11B1,CYP2D6,GBA,NCF1,PMS2,SMN1,STRC'.
97+
- `-h`: print the command-line help and exit.
98+
- `--dry_run`: print the pipeline commands, but do not actually execute them.
99+
100+
## Pipeline workflow
101+
102+
The Pangenome pipeline consists of multiple stages executed across two DAGs:
103+
104+
### First DAG
105+
1. **K-mer counting**: `kmc` counts k-mers from input FASTQ files
106+
2. **Sample-specific pangenome**: `vg haplotypes` creates a sample-specific pangenome using k-mer frequencies
107+
3. **Pangenome alignment**: `vg giraffe` aligns reads to the sample-specific pangenome
108+
4. **Read support computation**: `vg pack` computes read support for variants
109+
5. **Structural variant calling**: `vg call` identifies structural variants from read support
110+
6. **Linear alignment**: `vg surject` projects pangenome alignments back to linear reference
111+
7. **Duplicate marking**: Sentieon `Dedup` marks duplicate reads and computes metrics
112+
8. **Indel realignment**: Sentieon `Realigner` performs indel realignment
113+
9. **Small variant calling**: Sentieon `DNAscope` calls small variants (SNVs/indels)
114+
10. **Ploidy estimation**: Estimation of sample sex and ploidy for downstream processing
115+
11. **HLA/KIR genotyping**: T1K genotypes HLA and KIR loci (optional)
116+
12. **Segmental duplication calling**: segdup-caller identifies variants in segmental duplications (optional)
117+
118+
### Second DAG
119+
1. **CNV calling**: Sentieon CNVscope calls copy number variants with sex-specific processing
120+
2. **Repeat expansion calling**: ExpansionHunter identifies repeat expansions (optional)
121+
122+
## Pipeline output
123+
124+
### List of output files
125+
126+
The following files are output when processing WGS FASTQ:
127+
- `sample.vcf.gz`: Small variant calls (SNVs and indels) from DNAscope.
128+
- `sample_pangenome-aligned.bam`: Aligned, coordinate-sorted and duplicate-marked read data projected to the linear reference.
129+
- `sample_svs.vcf.gz`: Structural variant calls from vg call.
130+
- `sample_cnv.vcf.gz`: Copy number variant calls from CNVscope.
131+
- `sample_ploidy.json`: Estimated sample sex and ploidy data.
132+
- `sample_metrics`: A directory containing QC metrics for the analyzed sample.
133+
- `sample_metrics/{sample}.txt.alignment_stat.txt`: Metrics from the AlignmentStat algo.
134+
- `sample_metrics/{sample}.txt.base_distribution_by_cycle.txt`: Metrics from the BaseDistributionByCycle algo.
135+
- `sample_metrics/{sample}.txt.dedup_metrics.txt`: Metrics from the Dedup algo.
136+
- `sample_metrics/{sample}.txt.gc_bias*`: Metrics from the GCBias algo.
137+
- `sample_metrics/{sample}.txt.insert_size.txt`: Metrics from the InsertSizeMetricAlgo algo.
138+
- `sample_metrics/{sample}.txt.mean_qual_by_cycle.txt`: Metrics from the MeanQualityByCycle algo.
139+
- `sample_metrics/{sample}.txt.qual_distribution.txt`: Metrics from the QualDistribution algo.
140+
- `sample_metrics/{sample}.txt.wgs.txt`: Metrics from the WgsMetricsAlgo algo.
141+
- `sample_metrics/multiqc_report.html`: Collected QC metrics aggregated by MultiQC.
142+
143+
### Optional output files
144+
145+
When optional analyses are enabled:
146+
- `sample_hla/`: HLA genotyping results from T1K (when HLA files are provided).
147+
- `sample_kir/`: KIR genotyping results from T1K (when KIR files are provided).
148+
- `sample_segdups/`: Segmental duplication variant calls (when genes are specified).
149+
- `sample_expansion*`: Repeat expansion calls from ExpansionHunter (when a catalog is provided).
150+
151+
[vg]: https://github.com/vgteam/vg
152+
[KMC]: https://github.com/refresh-bio/KMC
153+
[samtools]: https://www.htslib.org/
154+
[bcftools]: http://samtools.github.io/bcftools/bcftools.html
155+
[MultiQC]: https://multiqc.info/
156+
[T1K]: https://github.com/mourisl/T1K
157+
[ExpansionHunter]: https://github.com/Illumina/ExpansionHunter
158+
[segdup-caller]: https://github.com/Sentieon/segdup-caller
159+
[sentieon-models]: https://github.com/Sentieon/sentieon-models
160+
161+
## Example data processing
162+
163+
### Obtaining the pangenome graph files
164+
165+
Download vg the public human pangenome from the HPRC repository
166+
167+
```sh
168+
curl -LO "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz"
169+
curl -LO "https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.hapl"
170+
```
171+
172+
Generate a snarls file from the .gbz with `vg snarls`.
173+
```sh
174+
vg snarls hprc-v1.1-mc-grch38.gbz > hprc-v1.1-mc-grch38.snarls
175+
```
176+
177+
Generate a xg file from the .gbz with `vg convert`.
178+
```sh
179+
vg convert -x --drop-haplotypes hprc-v1.1-mc-grch38.gbz > hprc-v1.1-mc-grch38.xg
180+
```
181+
182+
### Download the hg38 reference genome
183+
184+
Download the reference fasta file and samtools faidx file
185+
```sh
186+
curl -L 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz' | \
187+
gzip -dc > GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
188+
curl -LO 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai'
189+
```
190+
191+
### Download an expansion catalog of clinically relevant expansions
192+
193+
```sh
194+
curl -L 'https://github.com/Illumina/ExpansionHunter/raw/refs/tags/v5.0.0/variant_catalog/hg38/variant_catalog.json' > hg38_variant_catalog.json
195+
```
196+
197+
### Download t1k fasta files
198+
199+
Download the two databases
200+
```sh
201+
perl t1k-build.pl -o hlaidx --download IPD-IMGT/HLA
202+
perl t1k-build.pl -o kiridx --download IPD-KIR
203+
```
204+
205+
Download the gencode gtf file for hg38
206+
```sh
207+
curl -L 'https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz' | \
208+
gzip -dc > gencode.v38.annotation.gtf
209+
```
210+
211+
Bulid the coordinate file
212+
```sh
213+
perl t1k-build.pl -o hlaidx -d hlaidx/hla.dat -g gencode.v38.annotation.gtf
214+
perl t1k-build.pl -o kiridx -d kiridx/kir.dat -g gencode.v38.annotation.gtf
215+
```
216+
217+
### Download a 30x WGS fastq
218+
219+
```sh
220+
curl -LO 'https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R1.fastq.gz'
221+
curl -LO 'https://storage.googleapis.com/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free/30x/HG002.novaseq.pcr-free.30x.R2.fastq.gz'
222+
```
223+
224+
### Run the pangenome pipeline
225+
226+
```sh
227+
sentieon-cli -v pangenome \
228+
--reference GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
229+
--gbz hprc-v1.1-mc-grch38.gbz \
230+
--hapl hprc-v1.1-mc-grch38.hapl \
231+
--snarls hprc-v1.1-mc-grch38.snarls \
232+
--xg hprc-v1.1-mc-grch38.xg \
233+
--model_bundle pangenome.bundle \
234+
--r1_fastq HG002.novaseq.pcr-free.30x.R1.fastq.gz \
235+
--r2_fastq HG002.novaseq.pcr-free.30x.R2.fastq.gz \
236+
--readgroups '@RG\tID:HG002-1\tSM:HG002\tPL:ILLUMINA' \
237+
--expansion_catalog hg38_variant_catalog.json \
238+
--segdup_caller_genes 'CFH,CFHR3,CYP11B1,CYP2D6,GBA,NCF1,PMS2,SMN1,STRC' \
239+
--t1k_hla_seq hlaidx/data_dna_seq.fa \
240+
--t1k_hla_coord hlaidx/data_dna_coord.fa \
241+
--t1k_kir_seq kiridx/data_dna_seq.fa \
242+
--t1k_kir_coord kiridx/data_dna_coord.fa \
243+
HG002_pangenome_analysis.vcf.gz
244+
```

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
#https://stackoverflow.com/questions/75408641/whats-difference-between-tool-poetry-and-project-in-pyproject-toml
33
[tool.poetry]
44
name = "sentieon_cli"
5-
version = "1.3.0"
5+
version = "1.4.0"
66
description = "entry point for sentieon command-line tools"
77
authors = ["Don Freed <don.freed@sentieon.com>", "Brent <bpederse@gmail.com>"]
88
readme = "README.md"

sentieon_cli/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from .dnascope import DNAscopePipeline
33
from .dnascope_hybrid import DNAscopeHybridPipeline
44
from .dnascope_longread import DNAscopeLRPipeline
5+
from .pangenome import PangenomePipeline
56

67

78
def main():
@@ -43,6 +44,12 @@ def main():
4344
pipeline.add_arguments(dnascope_hybrid_subparser)
4445
dnascope_hybrid_subparser.set_defaults(pipeline=pipeline.main)
4546

47+
# Pangenome parser
48+
pipeline = PangenomePipeline()
49+
pangenome_subparser = subparsers.add_parser("pangenome")
50+
pipeline.add_arguments(pangenome_subparser)
51+
pangenome_subparser.set_defaults(pipeline=pipeline.main)
52+
4653
args = parser.parse_args()
4754
args.pipeline(args)
4855

0 commit comments

Comments
 (0)