-
Notifications
You must be signed in to change notification settings - Fork 196
Help, what is a graph?
This page is intended to explain variation graphs to non-computer scientists and people new to the field. Also see our explainer videos
A pangenome is a collection of genome sequences and the homology among them. There are many potential ways of representing a pangenome. Naively, a pangenome could be stored in a file containing the full haplotype sequences of all of the assemblies.
But at 3 billion base pairs for a human genome, and hundreds of genomes per pangenome, a file like this will quickly become too big to work with efficiently. Additionally, due to the similarity of human genomes, there will be a lot of redundant sequence that is stored multiple times. A more compact representation of the pangenome is to store the sequences common to all genomes only once. Then for each haplotype in the pangenome, store only the sequence at each site of variation
This collapsing of homologous sequences is the basis for creating a variation graph. In a variation graph, a node represents a nucleotide sequence and an edge occurs between sequences that can be connected. Homologous sequences in the pangenome are collapsed into a single node, and variants unique to each genome become separate nodes. Edges occur between nodes that are adjacent in the original sequences.
The graph can be collapsed further if there is homology within variants at one site. For example, haplotypes 1 and 3 have an insertion that is different by a single SNP. The homologous sequences in nodes 5 and 6 can be collapsed to form a nested site of variation, representing the SNP nested within an insertion.
The original haplotype sequences can be found by concatenating the sequences in nodes. For example, haplotype 1 can be found by taking node 1, node 2, node 4, node 5, node 7, node 8, and node 10. A sequence of nodes like this is known as a walk or a path through the graph. For a path through the graph to be valid, there must exist an edge between each pair of consecutive nodes. For example, there is no valid path walking from node 1 directly to node 4, without taking node 2 or node 4.
This structure is a variation graph.
A variation graph is a sequence graph (the nodes and edges) and a collection of haplotype paths through the graph. The sequence graph model used by the vg toolkit is a bi-directed graph with some extra restrictions. Nodes in sequence graphs have two sides, which are arbitrarily labelled as the left and right node side. Edges connect pairs of node sides. A valid path through the graph must enter and exit each node through opposite node sides. This is intuitive if we consider a node traversal to be a reading of the sequence. We cannot visit a node if we don't read it's sequence, and the sequence must be read from left to right, or from right to left. A right-to-left traversal of a node corresponds to the reverse complement of its sequence. A valid path must specify the orientation of each node in the path. For example, the blue path representing haplotype 1 above would be node 1 traversing forwards, node 2 traversing forwards, etc. We refer to a node and orientation as a visit or a traversal of that node.
Variation graphs can represent complex variants such as duplications and inversions.
In this example, the insertion represented by nodes 5-8 can be duplicated by taking the edge between the left side of node 5 and the right side of node 8. Node 13 represents an inversion; a path from node 12 to node 14 can traverse node 13 forwards (left-to-right) or backwards (right-to-left).