Skip to content

Nested Collections

Alex Redington edited this page Apr 23, 2020 · 1 revision

Nested Collections in Schism

Problem Statement

Sometimes it is desirable to have two nodes in a distributed cluster share deeply nested collections, e.g. {:a [1 2], :b {:c 5 :d "frog"}}. If two nodes were to affect updates, e.g. (update-in <map> [:b :c] inc) and (assoc-in <map> [:b :d] "cat"), the desired convergence would be the map {:a [1 2], :b {:c 6 :d "cat"}}

Building up a deeply nested structure could allow for solving this, but it requires that the client develop a rigorous tree walk based solution for converging nodes from the leaves up.

Delivering better results by handing two deeply nested maps or vectors to schism.core/converge will improve the value presented to clients.

Strategies for implementation

There are two broad directions to go in in solving this:

  1. Making converge's behavior for nested collections do more. (Deliberately not calling this 'smarter')
  2. Make nested collections manage each leaf atom as a discrete entry in a flat map, where the keys are paths in the map. The nested structure reflects projecting the flat map up into a deeper nested structure. (e.g. the above example map in its initial state would have keys and entries {[:a 0] 1 [:a 1] 2 [:b :c] 5 [:b :d] "frog"}) Convergence then becomes a matter of converging the two flat maps using the pre-existing ORMWOT semantics.

Making schism.core/converge do more

If we try to enhance the behavior of converge, it requires altering the semantics of schism collections: we have to examine two entries in the two collections, and if they are atoms, keep the most recently updated, and if they are collections, run converge on them. We can approach this by using a heuristic such as: retain the newest dot of any collection on the tree traversal from root to this node for each collection, and treat that as the timestamp for every value. This would allow copying a submap to a new key in the parent, and the submap's entries each having a newer timestamp than the submap originally has; which is likely the correct behavior.

However it also implies that any wholesale copy at any node of the tree will then privilege the entire subtree with newer timestamps, even if a value was not changed. Consider a map (def ex1 {:a :b, :c {:d :e :f :g} :h {:d :e}}) constructed from schism types. If one were to (def ex1a (assoc ex1 :h (:c ex1))), ex1a would now have structure of {:a :b, :c {:d :e :f :g} :h {:d :e :f :g}. Using this strategy, the entry at [:h :d] would have an updated timestamp, but was not modified.

Making nested collections

This will allow us to track the changes to each atom of the tree with precise timestamps, but requires more book keeping: each update at the root node to do book keeping of the edge modified, all unchanged leaves keep their timestamps, and only new modified leaves get new timestamps. This is contrary to how other schism collections behave, overwriting a key with the same value still twiddles the timestamp, for better or worse.

This requires storing a mirror structure of the client data for attribution data (node id and timestamp). Mirroring the nested structure will avoid having to repeat keys several times for deeply nested structures.

Implementation

Nested collections appear to align most closely with client use. Schism.core will receive three new methods:

schism.core/nested-map will return a new instance of a nesting map. This map will retain data to track each leaf node's update timestamp and author, and will converge similarly to a conventional ORMWOT

schism.core/nested-vector will return a new instance of a nesting vector. It will have functionality similar to a nested-map, with integer keys.

schism.core/nest will accept an arbitrarily nested collection of schism structures, and: return a new structure of the corresponding type from the above two, with each leaf node receiving an update timestamp corresponding to the newest timestamp of any of its ancestor nodes.

The following additional changes will be made:

schism.core/converge will support nested-map and nested-vector, allowing two nodes to gracefully update partially overlapping paths in the tree at the same time and retaining both of their changes

schism.core/converge will not at this time support cross-contaminating semantics: a schism.impl.types.map/Map will not converge with a nested map, nor vice versa. A schism.impl.types.vector/ConvergentVector will not converge with a nested vector, nor vice versa.

Following Steps

Given each invocation of assoc and conj will be computationally expensive, requiring walking the entire subtree from the root down the edge modified, we can definitely get some room to optimize this perf. As Clojure already has a useful interface for modifying a persistent collection in a high performance manner and then rendering the persistent commitment of the aggregate changes, we can opt into the interfaces backing clojure.core/transient and clojure.core/persistent! to allow for multiple changes to get made before incurring the tree-walking costs of updating timestamps. This will probably require breaking the constant time guarantees of these interfaces ¯_(ツ)_/¯