-
Notifications
You must be signed in to change notification settings - Fork 58
Utils to differentiate internal-external and writable-readonly tables #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f6b7a6d to
126579e
Compare
|
|
||
| extern PGDLLEXPORT int64 GetRemoteFileSize(char *path); | ||
| extern PGDLLEXPORT int64 GetRemoteParquetFileRowCount(char *path); | ||
| extern PGDLLEXPORT List *GetRemoteParquetColumnStats(char *path, List *leafFields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant declaration
| * GetForeignTablePath - get the path option for the foreign table. | ||
| */ | ||
| char * | ||
| GetForeignTablePath(Oid foreignTableId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved from other place
|
|
||
|
|
||
| IcebergCatalogType | ||
| GetIcebergCatalogType(Oid relationId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved from other place
126579e to
d447e61
Compare
sfc-gh-okalaci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit undecided whether we should add these functions or not. My rationale was would that cause any confusion in the code.
But then, seeing this in action, I think I like this better, we already have these concepts in the code, and this makes them even more explicit & better documented
| { | ||
| currentMetadataPath = GetMetadataLocationFromExternalObjectStoreCatalogForTable(relationId); | ||
| } | ||
| else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be more strict about this function, we shouldn't use it unless the table is an external iceberg table:
char *
GetIcebergExternalMetadataLocation(Oid relationId)
{
IcebergCatalogType icebergCatalogType = GetIcebergCatalogType(relationId);
if (icebergCatalogType == REST_CATALOG_READ_ONLY)
{
return GetMetadataLocationForRestCatalogForIcebergTable(relationId);
}
else if (icebergCatalogType == OBJECT_STORE_READ_ONLY)
{
return GetMetadataLocationFromExternalObjectStoreCatalogForTable(relationId);
}
else if (icebergCatalogType == NOT_ICEBERG_TABLE)
{
FindDataFormatAndCompression(tableType, path, options,
&sourceFormat, &sourceCompression);
if (sourceFormat == DATA_FORMAT_ICEBERG)
return GetForeignTablePath(relationId);;
}
ereport(ERROR, "expected ... not found..");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, we might want to have another function on top that can get metadata location for any table type. I needed that recently. Maybe better to return NULL and let callers decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should check for FindDataFormatAndCompression and sourceFormat == DATA_FORMAT_ICEBERG. If not, we could return NULL instead of err, that's fine. In the worst case, that's useful for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have Assert(IsAnyExternalIcebergTable(relationId)); But can expand with format checks as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think format check is still useful for two things: (a) make it a lot more clearer for the readers of the code on what class of tables this is (b) in the future, we might continue adding new table types -- even iceberg types -- so prevents any assumptions on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, we might want to have another function on top that can get metadata location for any table type. I needed that recently. Maybe better to return NULL and let callers decide.
I'm still in favor of throwing error instead of returning NULL. There are like ~15 callers of this function, and otherwise we'd have to add if not null checks to all the callers.
If we need a caller which prefers NULL, then we could add errorIfMissing flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now handle all external iceberg table cases properly, and throw error for any other table (considered misuse).
e42b446 to
7f03451
Compare
| { | ||
| currentMetadataPath = GetIcebergCatalogMetadataLocation(relationId, true); | ||
| } | ||
| char *currentMetadataPath = GetIcebergExternalMetadataLocation(relationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that this is wrong. GetIcebergCatalogMetadataLocation is equivalent to Internal iceberg table. So, we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that was typo which caused the test failure :)
Given many changes happened, let me review again before merging
a33ede2 to
407d6b6
Compare
pg_lake_table/src/ddl/vacuum.c
Outdated
| return; | ||
| } | ||
|
|
||
| if (!IsAnyInternalIcebergTable(relationId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was missing, we should only register missing fields for internal iceberg tables.
pg_lake_engine/src/utils/rel_utils.c
Outdated
|
|
||
|
|
||
| /* | ||
| * IsAnyInternalIcebergTable - check if the table is an internal iceberg table (written by us). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, do we really need Any in these function names? I'd be fine with IsInternalIcebergTable etc
| List * | ||
| CreatePostgresColumnMappingsForIcebergTableFromExternalMetadata(Oid relationId) | ||
| { | ||
| char *currentMetadataPath = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in an ideal world, we should add an assert IsAnyExternalIcebergTable(relationId) for being consistent with the function name & usage. But, that'd break VacuumRegisterMissingFields. VacuumRegisterMissingFields is really a special case, where we are dealing with an internal iceberg table, but we access its metadata.json for schema purposes.
So, I don't have a good suggestions here but if you can come up with something better, please do so,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no assert is fine. We get metadata location for any table, and operate on the metadata.json.
407d6b6 to
90f7ecc
Compare
90f7ecc to
8b40701
Compare
| DefElem *writableOption = GetOption(options, "writable"); | ||
| PgLakeTableType tableType = GetPgLakeTableType(foreignTableId); | ||
|
|
||
| return tableType == PG_LAKE_ICEBERG_TABLE_TYPE || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it could be tableType == PG_LAKE_ICEBERG_TABLE_TYPE but then also REST_CATALOG_READ_ONLY? Should we add tableType == PG_LAKE_ICEBERG_TABLE_TYPE && IsInternalIcebergTable || ...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, not related to this PR but needs the change because we also have writable object_store and rest catalog tables.
22c7548 to
8b9897a
Compare
sfc-gh-okalaci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few questions/notes
| if (PgLakeCopyValidityCheckHook) | ||
| PgLakeCopyValidityCheckHook(relationId); | ||
|
|
||
| ErrorIfReadOnlyIcebergTable(relationId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we remove this line any everything continues to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use correct check at postgresIsForeignRelUpdatable in pg_lake_table.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we call that for pushdownable COPY cases (Iceberg from URL)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh sorry for confusion, we do not push down such cases at IsCopyFromPushdownable by correct checks. Afterwards, we hit into postgresIsForeignRelUpdatable as fallback.
| default: | ||
| return NULL; | ||
| Assert(IsExternalIcebergTable(relationId)); | ||
| return GetDataFileSchemaForExternalIcebergTable(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if path is NULL? We are overlooking some of the existing cases here?
Same applies to GetLeafFieldsForTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be as GetIcebergMetadataLocation throws error in such cases. (same as before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetIcebergMetadataLocation can be NULL after #68 for writable rest tables.
| * throws an error if it is. | ||
| */ | ||
| void | ||
| ErrorIfReadOnlyIcebergTable(Oid relationId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot quite follow why we remove these functions? It was nicely documented functions, all in a single place.
Now, we are scattering this logic around, having multiple copies, and the benefit is minor improvement in the error messages?
I think if you want to change anything about this ErrorIf or Warn.. functions that they should definitely be a separate PR.
This PR has become a looooot larger than I thought, and this part seems arbitrarily added to the PR.
Can you please get these functions back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need those in any places that will hit into table's foreign data wrapper. Instead, we hit into error via postgresIsForeignRelUpdatable check. This makes the things easier to maintain. Alter ddl hooks and truncate are the only 2 places that we need to add such errors because we did not hit into postgresIsForeignRelUpdatable check for those.
1b05eb3 to
e3858c7
Compare
| return; | ||
| } | ||
|
|
||
| if (!IsInternalIcebergTable(relationId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vacuum should only register field for internal tables (that are in the internal tables catalog).
sfc-gh-okalaci
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, cleans up the utils that have been incrementally added, gets them into a nicer shape and less bug/error-prone.
| if (!IsIcebergTable(relationId)) | ||
| return NULL; | ||
|
|
||
| char *path = GetIcebergMetadataLocation(relationId, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: GetIcebergMetadataLocation can go into else part
| default: | ||
| return NULL; | ||
| Assert(IsExternalIcebergTable(relationId)); | ||
| return GetDataFileSchemaForExternalIcebergTable(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetIcebergMetadataLocation can be NULL after #68 for writable rest tables.
| if (EnableHeavyAsserts && tableType == PG_LAKE_ICEBERG_TABLE_TYPE && | ||
| !(GetIcebergCatalogType(relationId) == REST_CATALOG_READ_ONLY || | ||
| GetIcebergCatalogType(relationId) == OBJECT_STORE_READ_ONLY)) | ||
| if (EnableHeavyAsserts && IsInternalIcebergTable(relationId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
soon with #68 we'd end up undoing this change, we cannot do heavy asserts for writable-rest at this point.
Should we keep here as is maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets handle and comment about it in #68 to be consistent in this PR.
| * We cannot afford to have a different schema between the | ||
| * Postgres catalogs and the iceberg catalog. | ||
| */ | ||
| if (catalogType == REST_CATALOG_READ_ONLY || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marco once asked why do we treat DATA_FORMAT_ICEBERG differently for ErrorIfSchemasDoNotMatch. We could enforce the same for DATA_FORMAT_ICEBERG
Though, let's not do it in this PR, mostly sharing some context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I also though like that but schema check had failure for some of our tests. We need to dig into it in another PR.
Signed-off-by: Aykut Bozkurt <[email protected]>
e3858c7 to
0d01cfc
Compare
Preparation for #70.