Skip to content

Redesign data distribution to support elasticity #851

@senderista

Description

@senderista

The ability to add and remove workers dynamically on a running cluster (required, e.g., by @jortiz16's PerfEnforce work) will require some fundamental changes to Myria architecture. Briefly, we need to add a level of indirection among workers and the data they host. This new level of data storage granularity will still be called "partitions", but these partitions will be much smaller than what can fit on a single worker, and will be immutable, though mobile (i.e., they will never be split or merged as workers are added and removed). The new hash space for our hash partitioning will be the range of these partition IDs (which will be fixed for the life of the cluster), rather than the range of worker IDs (which can grow and shrink over the life of the cluster). Instead of rehashing data when the range of worker IDs changes, existing data partitions will be moved around among workers, in a manner similar to but distinct from consistent hashing.

How Relations are represented in the Myria catalog

A Myria Relation will continue to represent a set of partitions, but those partitions will now be 1) fixed in number for the life of the cluster, 2) much more fine-grained (and hence more numerous), and 3) will no longer map 1:1 to workers.

Fixed partitions

The number of partitions can be configured only at the time of cluster creation. If it is ever changed, it will require all data in all relations to be repartitioned (by hashing or round-robin distribution). Conceptually, all relations have NUM_PARTITIONS partitions, but some of them may be empty if a minimum size threshold (in bytes or tuples) is enforced for each partition. The set of nonempty partitions is always a prefix of the range of all partitions.

Fine-grained partitions

To minimize the overhead of repartitioning, when workers are added to or removed from the cluster, we move immutable partitions around among workers rather than repartitioning the data. Instead of directly hashing tuples to workers, tuples are hashed to partitions, where the partition space is fixed for the life of the cluster. These partitions are fine-grained enough to be evenly redistributed across a large cluster when just one worker is added or removed. A reasonable value for the number of partitions might be 2^6 - 2^8.

Explicit mapping from partitions to workers

To achieve similar performance to consistent hashing (only a small amount of data needs to be moved when nodes are added or removed), without the overhead of rescanning all data to repartition it (as we would need to do when cluster membership changes), we exploit the fact that we have a centralized system with a stable leader and an existing metadata store. Rather than using a randomized approach to map partitions to workers, we maintain an explicit mapping at the coordinator from partitions to workers. (The mapping function can be essentially arbitrary, as long as it distributes partitions sufficiently uniformly across workers.) The result is that like consistent hashing, we only need to move a small amount of data when cluster membership changes, but unlike consistent hashing, we only read the data we are going to move. Moreover, like consistent hashing, when a new node is added, only that node receives writes, so nodes currently servicing queries incur no additional write workload.

How Relations are stored and queried in Postgres

Each Relation in the Myria master catalog is distributed over NUM_PARTITIONS partitions physically implemented as Postgres tables on Myria workers. The Postgres implementation of a relation is a view which simply unions the tables representing all partitions of the relation. To simplify implementation, the view definition will be identical on all workers, and each partition will be represented by a Postgres table on all workers, including workers which do not host that partition. (Tables representing partitions not hosted on a worker will simply be empty, and as noted above, some partitions may be empty on all workers.) Hence, moving a partition from a worker means issuing a TRUNCATE TABLE statement for the partition's local table, and moving a partition to a worker could be implemented with a Postgres COPY statement targeting the partition's local table. The overhead of scanning empty tables should be minimal relative to the ease of implementation this uniform approach affords (especially if Postgres maintains accurate statistics, which it should, given that the tables are immutable between partition rebalancings).

Changes to Myria data operations

Ingesting a new Relation

When a new Relation is ingested (e.g., via MyriaL IMPORT and STORE), we will create NUM_PARTITIONS new Postgres tables in the Postgres database of each worker in the cluster, each table named for its corresponding partition ID. We will additionally create a new Postgres view in each worker's database consisting of the UNION of all partition tables for the relation. We will then create an initial mapping of partitions to workers in the catalog, based on the number of bytes/tuples in the relation. Finally, we will physically partition the data on the workers assigned nonempty partitions, using a given PartitionFunction.

Deleting an existing Relation

When an existing Relation is deleted (e.g., through the DELETE REST API), we will first tombstone its metadata in the catalog in a single SQLite transaction, then delete the corresponding Postgres views and tables on each worker (in a local transaction, with idempotent retry globally), then delete its metadata from the catalog in a single SQLite transaction.

New Myria cluster membership operations

AddWorkers

When a new node is added to the cluster, we will create a fixed number of new workers (e.g., one for each CPU core on the node). Then we will query the catalog and create new tables and views on each new worker for each existing relation, as described above. Initially we could force the user to manually rebalance relations, e.g., by issuing SCAN and STORE commands from MyriaL. Later we could automate the rebalancing, carefully considering (and testing!) its impact on running queries (but as noted above, all writes will be confined to the new node, which is not yet online, and all reads will be evenly distributed across the cluster).

RemoveWorkers

When an existing node is removed from the cluster, we must first move all its existing partitions to workers on other nodes (this of course inverts the I/O workload for adding a new node: reads are confined to the node being removed, but all nodes, in general, must accept writes). In principle, this cannot be made fault-tolerant without replication, but hosting Postgres tablespaces on persistent EBS volumes would probably give us acceptable durability, and backing up data to S3 would yield essentially 100% durability.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions