Horizontal scaling and sharding implementation for ThemisDB v1.4+.
Implements horizontal scaling and distributed sharding for ThemisDB, providing pluggable consensus algorithms (Raft, Gossip, Multi-Paxos), cross-shard SAGA transactions, automatic shard rebalancing, and the ShardRepairEngine for self-healing shard topology.
In scope: Hash-based and range-based shard routing, pluggable consensus (Raft, Gossip, Paxos), cross-shard SAGA transactions, shard rebalancing and repair, virtual node management.
Out of scope: Data replication at the storage layer (handled by replication module), network transport (handled by rpc module), query planning (handled by aql module).
shard_manager.cpp— shard topology and routing managementconsensus_factory.cpp— runtime consensus algorithm selection (Raft/Gossip/Paxos)cross_shard_transaction_coordinator.cpp— cross-shard SAGA/2PC/3PC transactionsshard_repair_engine.cpp— self-healing shard repair and rebalancing
Maturity: 🟡 Beta — Pluggable consensus (Raft/Gossip/Paxos), cross-shard transactions, ShardRepairEngine operational; full RPC integration and Paxos state persistence in progress.
- Shard manager and topology
- Data distribution strategies
- Consistent hashing
- Shard rebalancing
- TrueTime integration for global consistency
- Consensus Module Interface - Abstract interface for pluggable consensus
- Raft Consensus Adapter - Adapter for existing Raft implementation
- Gossip Consensus Adapter - Adapter for Gossip protocol
- Paxos Consensus - New Multi-Paxos implementation
- Consensus Factory - Runtime consensus selection
- Cross-Shard Transaction Coordinator - Pluggable transaction protocols
- Two-Phase Commit (2PC)
- Three-Phase Commit (3PC)
- SAGA (compensating transactions)
- Percolator (optimistic concurrency)
- Distributed Deadlock Detection
- Snapshot Isolation across shards
- Metadata Shard - Horizontally partitioned metadata
- Metadata Shard Router - Consistent hashing for metadata routing
- Partitioned by type: SCHEMA, INDEX, SHARD_MAP, TRANSACTION_LOG, etc.
- ShardRepairEngine (
include/sharding/shard_repair_engine.h) – automated self-healing for Parity (RAID-5/6) and Mirror shard setups. - Improved Reed-Solomon Decoder – Vandermonde matrix-based erasure recovery
supporting up to
parity_shardssimultaneous chunk failures (previously limited to 1).
| Capability | Details |
|---|---|
| Background scan | Configurable periodic anti-entropy scan across all shards |
| Auto-repair | Degraded documents detected during scan are queued for recovery |
| On-demand triggers | triggerRepair(shard_id), triggerFullScan(), triggerDocumentRepair(doc_id) |
| Per-shard health | ShardHealthReport with status enum: HEALTHY / DEGRADED / FAILED / REBUILDING |
| Job tracking | Every trigger returns a job_id; getJobStatus(job_id) polls progress |
| Prometheus metrics | exportPrometheusMetrics() or via ShardingMetricsHandler::getRepairMetrics() |
| Metric | Type | Description |
|---|---|---|
themis_shard_repair_scans_total |
counter | Anti-entropy scans performed |
themis_shard_repair_attempts_total |
counter | Repair attempts |
themis_shard_repair_successes_total |
counter | Successful repairs |
themis_shard_repair_failures_total |
counter | Failed repair attempts |
themis_shard_repair_documents_scanned_total |
counter | Documents checked |
themis_shard_repair_avg_duration_ms |
gauge | Rolling average repair time (ms) |
themis_shard_health{shard="..."} |
gauge | Per-shard health (0–3) |
themis_shard_degraded_documents{shard="..."} |
gauge | Degraded document count |
| Method | Path | Description |
|---|---|---|
POST |
/admin/repair |
Trigger repair (body: {"shard_id":"..."} or {} for all) |
POST |
/admin/repair/scan |
Trigger full anti-entropy scan |
GET |
/admin/repair/{job_id} |
Poll repair job status |
#include "sharding/shard_repair_engine.h"
themis::sharding::RepairConfig cfg;
cfg.scan_interval = std::chrono::seconds(300); // scan every 5 min
cfg.enable_auto_repair = true;
auto engine = std::make_shared<themis::sharding::ShardRepairEngine>(
cfg, strategy, ring, topology, read_handler, write_handler);
// Provide document list so the scanner knows what to check
engine->setDocumentListProvider([](const std::string& shard_id) {
return myStorage.listDocuments(shard_id);
});
engine->start();
// On-demand repair
std::string job_id = engine->triggerRepair("shard_3");
auto status = engine->getJobStatus(job_id);
// Wire up to existing Prometheus scrape endpoint
metricsHandler->setRepairEngine(engine);- Horizontal data partitioning
- Consistent hashing for shard assignment
- Dynamic shard rebalancing
- Metadata sharding prevents bottlenecks
- Pluggable consensus algorithms (Raft, Gossip, Paxos)
- Multiple transaction protocols (2PC, 3PC, SAGA, Percolator)
- ACID guarantees across multiple shards
- TrueTime-based external consistency
- Snapshot isolation for distributed reads
- Automatic failover with hot spares
- Partition detection and split-brain prevention
- Deadlock detection and resolution
- Multi-datacenter support
- Self-healing via ShardRepairEngine (v1.5+)
- Pluggable consensus module architecture
- Raft, Gossip, and Paxos consensus implementations
- Cross-shard transaction coordinator
- Transaction protocol abstraction (2PC, 3PC, SAGA, Percolator)
- Deadlock detection framework
- Metadata sharding design
- Comprehensive documentation
- ShardRepairEngine – anti-entropy background scan + repair queue
- Vandermonde-based Reed-Solomon decoder – full multi-chunk recovery
- Prometheus metrics integration for repair health
- Admin API repair endpoints (POST /admin/repair, /admin/repair/scan, GET /admin/repair/{id})
- Full RPC integration for cross-shard operations
- Persistent state management for Paxos
- Complete metadata shard implementation
- Advanced query optimization
For comprehensive sharding documentation, see:
- Distributed Sharding Architecture - NEW v1.4!
- Consensus Module Architecture - NEW v1.4!
- Data Migration Guide - NEW v1.4!
- Distributed Transactions with 2PC
- Sharding Implementation Summary
- Sharding Phase 1 Report
- Sharding Phases 1-3 Summary
- Horizontal Scaling Strategy
See usage examples in the architecture documentation above.
-
Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., & Lewin, D. (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC), 654–663. https://doi.org/10.1145/258533.258660
-
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: Amazon's Highly Available Key-Value Store. Proceedings of SOSP 2007, 205–220. https://doi.org/10.1145/1294261.1294281
-
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., … Woodford, D. (2013). Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems, 31(3), 8:1–8:22. https://doi.org/10.1145/2491245
-
Curino, C., Jones, E., Zhang, Y., & Madden, S. (2010). Schism: A Workload-Driven Approach to Database Replication and Partitioning. Proceedings of the VLDB Endowment, 3(1–2), 48–57. https://doi.org/10.14778/1920841.1920853