Starfish is a tool for comparing and intersecting multiple VCF files with haplotype awareness by using the powerful RTGTools vcfeval engine. The name "Starfish" comes from the shape of the Venn diagram the program can draw (with 5 VCFs!).
git clone --recursive https://github.com/dancooke/starfishThere are just three required options:
--sdf(short-t): The RTG Tools SDF reference directory (usertg format)--variants(short-V): A list of VCF files to intersect.--output(short-O): A directory path to write intersections.
For example:
./starfish \
-t reference.sdf \
-V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
-O isecWill result in the directory isec containing the following files:
A.vcf.gz: Records unique tovcf1.vcf.gz.B.vcf.gz: Records unique tovcf2.vcf.gz.C.vcf.gz: Records unique tovcf3.vcf.gz.AB.vcf.gz: Records invcf1.vcf.gzandvcf2.vcf.gzbut notvcf3.vcf.gz.AC.vcf.gz: Records invcf1.vcf.gzandvcf3.vcf.gzbut notvcf2.vcf.gz.BC.vcf.gz: Records invcf2.vcf.gzandvcf3.vcf.gzbut notvcf1.vcf.gz.ABC.vcf.gz: Records common tovcf1.vcf.gz,vcf2.vcf.gz, andvcf3.vcf.gz.
In other words, the VCF files are labelled (in order) using upper-case letters, and the filenames in the output directory contain records unique to the labels in the filename.
By default, all regions in the reference genome (which must be the same for all input VCFs) are used. To restrict comparison to a subset of regions, supply a BED file to the --regions option.
By default, records that are filtered are not included in the comparison. To include them add the --all-records option the your command.
By default, records will not be matched if the genotypes do not match. To ignore genotype mismatches (and only compare called alleles), use the --squash-ploidy option:
./starfish \
-t reference.sdf \
-V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
-O isec \
--squash-ploidyTo compare callsets without genotypes; only use ALT alleles:
./starfish \
-t reference.sdf \
-V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
-O isec \
--samples ALT \
--squash-ploidyStarfish can draw Venn diagrams showing the number of intersected records for up to 6 VCFs (if the pyvenn package is installed). To do this you need to supply names for each of the VCFs with the --names option and add the --venn command:
./starfish \
-t reference.sdf \
-V vcf1.vcf.gz vcf2.vcf.gz vcf3.vcf.gz \
-O isec \
--names Octopus GATK4 FreeBayes \
--vennStarfish has a number of limitations:
- Only haploid and diploid genotype comparisons are supported (due to RTGTools vcfeval).
- Only one sample can be compared. You can use the
--sampleoption if your VCFs have multiple samples, but the given sample must be present in all input VCFs. - The number of unique intersections grows exponentially with the number of input VCFs.
