-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet File Metadata caching implementation #541
base: project-antalya
Are you sure you want to change the base?
Conversation
This is an automated comment for commit 5a7a8ad with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
{ | ||
if (!use_metadata_cache || !metadata_cache_key.length()) | ||
{ | ||
ProfileEvents::increment(ProfileEvents::ParquetMetaDataCacheMisses); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want metadata cache miss metrics being incremented when use_metadata_cache=false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I coded the increment in the first pass, but removed it so that users need not see a new alarming "miss" metric if they are unaware/disabled the in-memory cache feature (and satisfied with the existing local disk caching feature). Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry, I did not quite follow your explanation. As far as I can tell (by reading the code), whenever initializeIfNeeded()
is called (and it is called once per file, iirc), it'll try to fetch metadata either from cache or from input source.
If cache is disabled, it'll still increment the cache miss. It doesn't look right to me at a first glance, but if there has been a discussion and this was the chosen design, that's ok.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I mixed up my revisions and reasoning. I have removed the increment if cache is disabled. Thanks!
@@ -0,0 +1,28 @@ | |||
-- Tags: no-parallel, no-fasttest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a few tests for parquet_metadata_cache_max_entries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have to be an integration test, maybe with 10s or 100's of parquet files. I can add it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If local files also benefited from metadata cache, an integration test wouldn't be needed I suppose. But doesn't look like we want to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local Parquet files, OS file cache will be in effect.
|
||
private: | ||
ParquetFileMetaDataCache(UInt64 max_cache_entries); | ||
static std::mutex mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to me CacheBase
already has a mutex https://github.com/ClickHouse/ClickHouse/blob/94709d7a5a1ede65a9eed92ff9cbf73e28e62561/src/Common/CacheBase.h#L245.
Why do we need another one? And why is it static?
I think I am missing something. ParquetFileMetaDataCache
is a static instance, shared among all instances. Why do you need to make this mutex static as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basic double checked locking pattern of singleton initialization.
I will replace this with the modern equivalent that uses std::once_only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check my comment in https://github.com/Altinity/ClickHouse/pull/541/files#r1875950244. I might be missing something, tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
{ | ||
std::lock_guard lock(mutex); | ||
static std::once_flag once; | ||
std::call_once(once, [&] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's take a step back. IIRC, static initialization in c++ 11 or greater is guaranteed to happen only once and it is thread safe.
From the cpp standard:
such a variable is initialized the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. [...] If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.
So, wouldn't something like the below suffice?:
static ParquetFileMetaDataCache instance(max_cache_entries);
return instance;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -166,7 +166,8 @@ namespace DB | |||
M(String, mutation_workload, "default", "Name of workload to be used to access resources for all mutations (may be overridden by a merge tree setting)", 0) \ | |||
M(Bool, prepare_system_log_tables_on_startup, false, "If true, ClickHouse creates all configured `system.*_log` tables before the startup. It can be helpful if some startup scripts depend on these tables.", 0) \ | |||
M(UInt64, config_reload_interval_ms, 2000, "How often clickhouse will reload config and check for new changes", 0) \ | |||
M(Bool, disable_insertion_and_mutation, false, "Disable all insert/alter/delete queries. This setting will be enabled if someone needs read-only nodes to prevent insertion and mutation affect reading performance.", 0) | |||
M(Bool, disable_insertion_and_mutation, false, "Disable all insert/alter/delete queries. This setting will be enabled if someone needs read-only nodes to prevent insertion and mutation affect reading performance.", 0) \ | |||
M(UInt64, input_format_parquet_metadata_cache_max_entries, 100000, "Maximum number of parquet file metadata to cache.", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned it into a server settings, makes more sense as it can't be changed at runtime
Implements Parquet Metadata caching.
More details and documentation in link.
Changelog category (leave one):
Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: