Skip to content

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Mar 6, 2024

Right now, a significant amount of work in the Crucible Downstairs is simply hashing data:

Together, this take up about 17% of our flamegraph, and are both in the do_work task (dw_task):

Screenshot 2024-03-06 at 11 43 59 AM

This PR moves that hash checking off-thread into the rayon thread pool, which is common for this kind of blocking work:

    ┌──────────┐              ┌───────────┐
    │FramedRead│              │FramedWrite│
    └────┬─────┘              └─────▲─────┘
         │                          │
         │         ┌────────────────┴────────────────────┐
         │         │         framed_write_task           │
         │         └─▲─────▲──────────────────▲──────────┘
         │           │     │                  │
         │       ping│     │invalid           │
         │  ┌────────┘     │frame             │responses
         │  │              │errors            │
         │  │              │                  │
    ┌────▼──┴─┐ message   ┌┴──────┐  job     ┌┴────────┐
    │resp_loop├──────────►│pf_task├─────────►│ dw_task │
    └──┬───▲──┘ channel   └──┬────┘ channel  └▲────────┘
       │   │                 │                │
  defer│   │oneshot          │                │
      ┌▼───┴┐                │                │
      │rayon│             add│work         new│work
      └─────┘                │                │
                             │                │
    per-connection           │                │
   ========================= │ ============== │ ===============
    shared state          ┌──▼────────────────┴────────────┐
                          │           Downstairs           │
                          └────────────────────────────────┘

The strategy is very similar to #1066 and #1089 , and uses the same DeferredQueue data structure (moved to crucible_common):

  • When messages arrive, they are pushed onto the deferred queue if (1) they are writes or (2) there are other messages in the deferred queue. This ensures that messages always arrive in order, while not deferring unnecessarily
  • Deferred messages arrive (in order) back at resp_loop with all of their write metadata (e.g. hashes) precomputed. We then use those hashes in region_write_pre, instead of computing them in Region::region_write and ExtentInner::write

This is a roughly 10% speedup for large writes:

1M WRITE: bw=809MiB/s (848MB/s), 809MiB/s-809MiB/s (848MB/s-848MB/s), io=47.5GiB (51.0GB), run=60129-60129msec
4K WRITE: bw=22.8MiB/s (23.9MB/s), 22.8MiB/s-22.8MiB/s (23.9MB/s-23.9MB/s), io=1367MiB (1433MB), run=60010-60010msec
1M WRITE: bw=808MiB/s (847MB/s), 808MiB/s-808MiB/s (847MB/s-847MB/s), io=47.4GiB (50.9GB), run=60046-60046msec
4M WRITE: bw=816MiB/s (855MB/s), 816MiB/s-816MiB/s (855MB/s-855MB/s), io=47.9GiB (51.4GB), run=60098-60098msec

Previously, I was seeing numbers in the 700-ish range, e.g.

1M WRITE: bw=725MiB/s (760MB/s), 725MiB/s-725MiB/s (760MB/s-760MB/s), io=42.6GiB (45.7GB), run=60134-60134msec
4K WRITE: bw=20.1MiB/s (21.1MB/s), 20.1MiB/s-20.1MiB/s (21.1MB/s-21.1MB/s), io=1208MiB (1266MB), run=60027-60027msec
1M WRITE: bw=720MiB/s (755MB/s), 720MiB/s-720MiB/s (755MB/s-755MB/s), io=42.2GiB (45.3GB), run=60052-60052msec
4M WRITE: bw=716MiB/s (751MB/s), 716MiB/s-716MiB/s (751MB/s-751MB/s), io=42.0GiB (45.1GB), run=60099-60099msec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant