Skip to content

Latest commit

 

History

History
116 lines (83 loc) · 2.99 KB

File metadata and controls

116 lines (83 loc) · 2.99 KB

Quick start

This guide walks you through indexing a small set of genomes and searching for a query sequence.

Step 1: Prepare genome files

Organise your reference genomes as individual FASTA files in a directory:

genomes/
  genome_001.fasta
  genome_002.fasta
  genome_003.fasta
  ...

Supported extensions: .fa, .fasta, .fna, .fsa

Step 2: Build the index

dragon index \
  --input genomes/ \
  --output my_index/ \
  --kmer-size 31 \
  --threads 8

This creates the following files in my_index/:

File Description
fm_index.bin FM-index over concatenated unitig sequences
colors.drgn Roaring-bitmap colour index (unitig → genome mapping)
paths.bin Genome path index (mmap-friendly v2 format)
specificity.drgn Per-genome private-unitig sets
unitigs.fa Unitig sequences from the de Bruijn graph (optional after build)
metadata.json Index statistics (genome count, k-mer size, total bases)

Step 3: Search

dragon search \
  --index my_index/ \
  --query query_genes.fasta \
  --output results.paf \
  --threads 8

Step 4: Inspect results

# View PAF output
head results.paf

# Count hits per query
cut -f1 results.paf | sort | uniq -c | sort -rn | head

# View index statistics
dragon info --index my_index/

Example output

PAF format (tab-separated):

gene_001  1500  10  1490  +  genome_042  4800000  123456  124946  1450  1490  60  AS:i:2900
gene_001  1500  10  1490  +  genome_108  5100000  234567  236057  1430  1490  55  AS:i:2860

Columns: query name, query length, query start, query end, strand, target name, target length, target start, target end, matches, alignment length, mapping quality, tags.

Step 5 (optional): Multi-shard search

If your collection is too large for a single index, build several shards and search them as one:

dragon search \
  --index shard_a/ \
  --shard shard_b/ \
  --shard shard_c/ \
  --query query_genes.fasta \
  --output results.paf

Each shard is loaded in turn (memory-bounded) and results are merged with per-genome deduplication.

Step 6 (optional): Cloud-native deployment

Export an index as a Zarr v3 store for direct reading from S3 / GCS:

dragon export-zarr -i my_index/ -o my_index.zarr/
aws s3 sync my_index.zarr/ s3://your-bucket/my_index/

# Anywhere with internet (no AWS creds needed for public buckets):
pip install 'zarr>=3.0' s3fs numcodecs
python scripts/zarr_demo.py s3://your-bucket/my_index

A pre-built 16,000-genome demo lives at s3://dragon-zarr/saureus/b1/ (eu-west-2, public-read). See Architecture overview for details.

Step 7 (optional): Surveillance summary

For AMR-gene panels and similar epidemiological queries, ask for a per-species summary instead of raw PAF:

dragon search -i my_index/ -q amr_genes.fa --format summary > prevalence.tsv

Or post-process an existing PAF:

dragon summarize --input results.paf --format tsv > prevalence.tsv