Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: improve flush upload queue parallelism #10096

Open
Tracked by #9624
erikgrinaker opened this issue Dec 11, 2024 · 3 comments · May be fixed by #10218 or #10144
Open
Tracked by #9624

pageserver: improve flush upload queue parallelism #10096

erikgrinaker opened this issue Dec 11, 2024 · 3 comments · May be fixed by #10218 or #10144
Assignees
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver

Comments

@erikgrinaker
Copy link
Contributor

Currently, when flushing delta layers, we upload the layer and then update the index. But index updates act as an upload queue parallelism barrier. This means that we're effectively uploading layers one at a time.

Instead, we should flush a batch of layer files (something like 100-1000 MB) for each index update, allowing us to upload these in parallel.

Also note that the flush loop backpressure from #8550 will also prevent parallelism. We'll need to remove this first, see #10095.

@erikgrinaker erikgrinaker added a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver labels Dec 11, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 11, 2024
@skyzh
Copy link
Member

skyzh commented Dec 11, 2024

In theory if we put multiple upload tasks into the queue, they will be fired in parallel.

let can_run_now = match next_op {
UploadOp::UploadLayer(..) => {
// Can always be scheduled.
true
}
UploadOp::UploadMetadata { .. } => {
// These can only be performed after all the preceding operations
// have finished.
upload_queue.inprogress_tasks.is_empty()
}
UploadOp::Delete(..) => {
// Wait for preceding uploads to finish. Concurrent deletions are OK, though.
upload_queue.num_inprogress_deletions == upload_queue.inprogress_tasks.len()
}
UploadOp::Barrier(_) | UploadOp::Shutdown => {
upload_queue.inprogress_tasks.is_empty()
}
};

I think compaction already utilized this code path?

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 11, 2024

Yes. The problem is that when we flush a delta layer, we schedule both an UploadLayer and an UploadMetadata for every layer sequentially. As you can see in the code you quoted, UploadMetadata acts as a barrier for parallel uploads, since it waits for all preceding layers to finish uploading.

We want to flush a batch of layers and schedule UploadLayer for them all, and then schedule an UploadMetadata barrier for the entire batch.

for layer in layers_to_upload {
self.remote_client.schedule_layer_file_upload(layer)?;
}
self.remote_client
.schedule_index_upload_for_metadata_update(&update)?;

// Schedule remote uploads that will reflect our new disk_consistent_lsn
self.schedule_uploads(disk_consistent_lsn, layers_to_upload)
.map_err(|e| FlushLayerError::from_anyhow(self, e))?;

// Normal case, write out a L0 delta layer file.
// `create_delta_layer` will not modify the layer map.
// We will remove frozen layer and add delta layer in one atomic operation later.
let Some(layer) = self
.create_delta_layer(&frozen_layer, None, ctx)
.await
.map_err(|e| FlushLayerError::from_anyhow(self, e))?
else {
panic!("delta layer cannot be empty if no filter is applied");
};
(
// FIXME: even though we have a single image and single delta layer assumption
// we push them to vec
vec![layer.clone()],
Some(layer),
)

@erikgrinaker erikgrinaker linked a pull request Dec 13, 2024 that will close this issue
@erikgrinaker
Copy link
Contributor Author

There's a prototype PR for deferred index uploads during flush in #10144. However, this approach has a fair number of issues -- we have to wait for "some time" in case further layers are flushed, and this both breaks caller expectations that index uploads are scheduled immediately, and it can still cause index barriers if the previous index isn't uploaded before the next layer comes in.

Instead, let's try to reorder layer uploads ahead of index uploads in the upload queue:

// If we cannot launch this task, don't look any further.
//
// In some cases, we could let some non-frontmost tasks to "jump the queue" and launch
// them now, but we don't try to do that currently. For example, if the frontmost task
// is an index-file upload that cannot proceed until preceding uploads have finished, we
// could still start layer uploads that were scheduled later.
if !can_run_now {
break;
}

A few things to keep in mind:

  • We want to upload indexes as soon as the necessary layers are uploaded.
  • We want to coalesce indexes where we can.
  • We don't want to reorder operations on the same layer filename.
  • We don't want to reorder wrt. deletes.

github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
This reverts commit f3ecd5d.

It is
[suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759)
to have caused significant read amplification in the [ingest
benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs)
(specifically during index creation).

We will revisit an intermediate improvement here to unblock [upload
parallelism](#10096) before
properly addressing [compaction
backpressure](#8390).
erikgrinaker added a commit that referenced this issue Jan 3, 2025
This reverts commit f3ecd5d.

It is
[suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759)
to have caused significant read amplification in the [ingest
benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs)
(specifically during index creation).

We will revisit an intermediate improvement here to unblock [upload
parallelism](#10096) before
properly addressing [compaction
backpressure](#8390).
github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
## Problem

It's not legal to modify layers that are referenced by the current layer
index. Assert this in the upload queue, as preparation for upload queue
reordering.

Touches #10096.

## Summary of changes

Add a debug assertion that the upload queue does not modify layers
referenced by the current index.

I could be convinced that this should be a plain assertion, but will be
conservative for now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver
Projects
None yet
2 participants