-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writes during ingestion #400
base: 8.10.tikv
Are you sure you want to change the base?
Conversation
7340dac
to
82ad5dc
Compare
/cc @glorv @LykxSassinator @Connor1996 @v01dstar PTAL, thx~ |
@@ -2095,6 +2095,9 @@ struct IngestExternalFileOptions { | |||
// | |||
// ingest_behind takes precedence over fail_if_not_bottommost_level. | |||
bool fail_if_not_bottommost_level = false; | |||
// Set to TRUE if user wants to allow writes to the DB during ingestion. | |||
// User must ensure no writes overlap with the ingested data. | |||
bool allow_write = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_write
is a little confusing, how about allow_foreground_write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This naming style aligns well with RocksDB’s conventions. In RocksDB, similar names, such as unordered_write and WaitForPendingWrites, typically use “write” to refer to foreground writes.
db/db_impl/db_impl.cc
Outdated
if (two_write_queues_) { | ||
nonmem_write_thread_.ExitUnbatched(&nonmem_w); | ||
} | ||
write_thread_.ExitUnbatched(&w); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just skip L5856-5869 when allowing write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with allow_write = true, writes to the DB must be temporarily stopped to wait for pending writes. This is because allow_write = true only requires users to ensure no concurrent writes overlap with the ingestion data and does not require ensuring no overlapping unordered_write before ingestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@v01dstar @Connor1996 PTAL again, thx~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can skip lines L5856-5869 and treat unfinished unordered_write as concurrent writes. Users need to ensure unordered_writes are finished before ingestion. This approach might better align with the original intent of allow_write, as it offers better performance when users can guarantee no concurrent writes. TiKV seems to be able to guarantee this. @v01dstar @Connor1996 What do you think of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, in theory, we can ignore those preceding-queued-unfinished foreground writes.
However, I don't have absolute confidence, since this change made many "assumptions". Just a side note, we probably need to test it against Jepsen test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am now confident that in all current TiKV scenarios where allow_write is enabled, ongoing writes will always be waited for to finish. Therefore, I updated the code to no longer wait for pending writes.
This conclusion is supported by the many tests added directly to test_ingest_sst.rs in this TiKV PR (tikv/tikv#18096).
6a859b7
to
b695598
Compare
bool allow_write = args[0].options.allow_write; | ||
for (const auto& arg : args) { | ||
if (arg.options.allow_write != allow_write) { | ||
return Status::InvalidArgument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better fallback to the default behavior if any of this options is false. Return error here will cause tikv panic in some code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary. Firstly, from RocksDB’s perspective, the checks should be as strict as possible. Secondly, TiKV does not currently use the IngestExternalFiles interface directly; it only uses IngestExternalFile, which calls IngestExternalFiles with a single arg
, i.e., only one allow_write. Therefore, this would not affect TiKV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[LGTM Timeline notifier]Timeline:
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: v01dstar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It comes to me that titan GC and compaction filter GC may write back some kvs.
Considering the case, there is a peer before, then it's moved to other stores. And it's moved back to this store again.
Titan GC would be okay as it won't write back any overlapping key after the peer is destroyed. While compaction filter GC may perform some deletion as the same time. Please consider compaction filter GC case as well.
@Connor1996 How does Titan GC avoid writing back any overlapping keys after the peer is destroyed? |
All keys in that range are deleted when destroying. And Titan GC find keys are deleted and just skip that key for generating new blob file |
Titan GC wouldn't write overlap keys. For a key to be written back, it needs to be 1. "existing in the current view" 2. the value in LSM tree is a reference to the blob store, and the reference points to the exact same location of the to-be-examined blob key. So, when the ingest happens, if Titan GC reads the moved-back kv, it will see that value is not a reference, it does not meet requirement 2. If Titan GC does not read the moved-back KV, it will get a |
c11f98d
to
ad98a17
Compare
Signed-off-by: hhwyt <[email protected]>
Signed-off-by: hhwyt <[email protected]>
Signed-off-by: hhwyt <[email protected]>
ad98a17
to
ff3e293
Compare
Signed-off-by: hhwyt <[email protected]>
e2c41b1
to
40e0131
Compare
This PR adds an option
allow_write
toIngestExternalFileOptions
, enabling writes to the database during the ingestion process.More details at tikv/tikv#18096.