-
Notifications
You must be signed in to change notification settings - Fork 196
Extracting a FASTA from a Graph
Graph references often contain linear references within them, which you might want copies of for, for example, calling variants with a linear-reference-based caller like Google's DeepVariant.
If you don't already have a FASTA file for an assembly that is included in a graph, you can use vg to extract the assembly FASTA directly from the graph, like this:
vg paths --extract-fasta -x test/graphs/rgfa_with_reference.rgfa --paths-by GRCh38
Here, the argument to -x
should be the graph file, in rGFA, GFA, .vg
, .gbz
, or any other graph file format that vg can read (see File Formats). The argument to --paths-by
should be the prefix of the set of paths you would like to extract; generally you can use a sample or assembly name here. You can use vg paths --list -x <the graph>
to get a list of all paths available.
This will produce a FASTA file on standard output:
>GRCh38#0#chr1
GGGGTACA
In most cases, the sequence names in the FASTA will be in PanSN format (see Path Metadata Model); these will match the names used by vg surject
, and so a FASTA extracted like this is easy to use with a BAM file produced by vg surject
.
To save it to a file, you can redirect the output with >
.
If you are interested in extracting haplotype paths from a .gbwt
file, you can pass the .gbwt
file with the -g
option to vg paths
, and the corresponding .gg
file or any matching graph with -x
.