Redesign data distribution to support elasticity

The ability to add and remove workers dynamically on a running cluster (required, e.g., by @jortiz16's PerfEnforce work) will require some fundamental changes to Myria architecture. Briefly, we need to add a level of indirection among workers and the data they host. This new level of data storage granularity will still be called "partitions", but these partitions will be much smaller than what can fit on a single worker, and will be immutable, though mobile (i.e., they will never be split or merged as workers are added and removed). The new hash space for our hash partitioning will be the range of these partition IDs (which will be fixed for the life of the cluster), rather than the range of worker IDs (which can grow and shrink over the life of the cluster). Instead of rehashing data when the range of worker IDs changes, existing data partitions will be moved around among workers, in a manner similar to but distinct from consistent hashing.
# How `Relation`s are represented in the Myria catalog

A Myria `Relation` will continue to represent a set of partitions, but those partitions will now be 1) fixed in number for the life of the cluster, 2) much more fine-grained (and hence more numerous), and 3) will no longer map 1:1 to workers.
## Fixed partitions

The number of partitions can be configured only at the time of cluster creation. If it is ever changed, it will require all data in all relations to be repartitioned (by hashing or round-robin distribution). Conceptually, all relations have `NUM_PARTITIONS` partitions, but some of them may be empty if a minimum size threshold (in bytes or tuples) is enforced for each partition. The set of nonempty partitions is always a prefix of the range of all partitions.
## Fine-grained partitions

To minimize the overhead of repartitioning, when workers are added to or removed from the cluster, we move immutable partitions around among workers rather than repartitioning the data. Instead of directly hashing tuples to workers, tuples are hashed to partitions, where the partition space is fixed for the life of the cluster. These partitions are fine-grained enough to be evenly redistributed across a large cluster when just one worker is added or removed. A reasonable value for the number of partitions might be `2^6` - `2^8`.
## Explicit mapping from partitions to workers

To achieve similar performance to consistent hashing (only a small amount of data needs to be moved when nodes are added or removed), without the overhead of rescanning all data to repartition it (as we would need to do when cluster membership changes), we exploit the fact that we have a centralized system with a stable leader and an existing metadata store. Rather than using a randomized approach to map partitions to workers, we maintain an explicit mapping at the coordinator from partitions to workers. (The mapping function can be essentially arbitrary, as long as it distributes partitions sufficiently uniformly across workers.)  The result is that like consistent hashing, we only need to move a small amount of data when cluster membership changes, but unlike consistent hashing, we only read the data we are going to move. Moreover, like consistent hashing, when a new node is added, only that node receives writes, so nodes currently servicing queries incur no additional write workload.
# How `Relation`s are stored and queried in Postgres

Each `Relation` in the Myria master catalog is distributed over `NUM_PARTITIONS` partitions physically implemented as Postgres tables on Myria workers. The Postgres implementation of a relation is a view which simply unions the tables representing all partitions of the relation. To simplify implementation, the view definition will be identical on all workers, and each partition will be represented by a Postgres table on all workers, including workers which do not host that partition. (Tables representing partitions not hosted on a worker will simply be empty, and as noted above, some partitions may be empty on all workers.)  Hence, moving a partition _from_ a worker means issuing a `TRUNCATE TABLE` statement for the partition's local table, and moving a partition _to_ a worker could be implemented with a Postgres `COPY` statement targeting the partition's local table. The overhead of scanning empty tables should be minimal relative to the ease of implementation this uniform approach affords (especially if Postgres maintains accurate statistics, which it should, given that the tables are immutable between partition rebalancings).
# Changes to Myria data operations
## Ingesting a new `Relation`

When a new `Relation` is ingested (e.g., via MyriaL `IMPORT` and `STORE`), we will create `NUM_PARTITIONS` new Postgres tables in the Postgres database of each worker in the cluster, each table named for its corresponding partition ID. We will additionally create a new Postgres view in each worker's database consisting of the `UNION` of all partition tables for the relation. We will then create an initial mapping of partitions to workers in the catalog, based on the number of bytes/tuples in the relation. Finally, we will physically partition the data on the workers assigned nonempty partitions, using a given `PartitionFunction`.
## Deleting an existing `Relation`

When an existing `Relation` is deleted (e.g., through the `DELETE` REST API), we will first tombstone its metadata in the catalog in a single SQLite transaction, then delete the corresponding Postgres views and tables on each worker (in a local transaction, with idempotent retry globally), then delete its metadata from the catalog in a single SQLite transaction.
# New Myria cluster membership operations
## `AddWorkers`

When a new node is added to the cluster, we will create a fixed number of new workers (e.g., one for each CPU core on the node). Then we will query the catalog and create new tables and views on each new worker for each existing relation, as described above. Initially we could force the user to manually rebalance relations, e.g., by issuing `SCAN` and `STORE` commands from MyriaL. Later we could automate the rebalancing, carefully considering (and testing!) its impact on running queries (but as noted above, all writes will be confined to the new node, which is not yet online, and all reads will be evenly distributed across the cluster).
## `RemoveWorkers`

When an existing node is removed from the cluster, we must first move all its existing partitions to workers on other nodes (this of course inverts the I/O workload for adding a new node: reads are confined to the node being removed, but all nodes, in general, must accept writes). In principle, this cannot be made fault-tolerant without replication, but hosting Postgres tablespaces on persistent EBS volumes would probably give us acceptable durability, and backing up data to S3 would yield essentially 100% durability. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign data distribution to support elasticity #851

How `Relation`s are represented in the Myria catalog

Fixed partitions

Fine-grained partitions

Explicit mapping from partitions to workers

How `Relation`s are stored and queried in Postgres

Changes to Myria data operations

Ingesting a new `Relation`

Deleting an existing `Relation`

New Myria cluster membership operations

`AddWorkers`

`RemoveWorkers`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Redesign data distribution to support elasticity #851

Description

How Relations are represented in the Myria catalog

Fixed partitions

Fine-grained partitions

Explicit mapping from partitions to workers

How Relations are stored and queried in Postgres

Changes to Myria data operations

Ingesting a new Relation

Deleting an existing Relation

New Myria cluster membership operations

AddWorkers

RemoveWorkers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How `Relation`s are represented in the Myria catalog

How `Relation`s are stored and queried in Postgres

Ingesting a new `Relation`

Deleting an existing `Relation`

`AddWorkers`

`RemoveWorkers`