Skip to content

Conversation

@sfc-gh-abozkurt
Copy link
Collaborator

@sfc-gh-abozkurt sfc-gh-abozkurt commented Nov 20, 2025

Preparation for #70.

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch from f6b7a6d to 126579e Compare November 20, 2025 10:51

extern PGDLLEXPORT int64 GetRemoteFileSize(char *path);
extern PGDLLEXPORT int64 GetRemoteParquetFileRowCount(char *path);
extern PGDLLEXPORT List *GetRemoteParquetColumnStats(char *path, List *leafFields);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant declaration

* GetForeignTablePath - get the path option for the foreign table.
*/
char *
GetForeignTablePath(Oid foreignTableId)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from other place



IcebergCatalogType
GetIcebergCatalogType(Oid relationId)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from other place

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch from 126579e to d447e61 Compare November 20, 2025 10:55
sfc-gh-okalaci
sfc-gh-okalaci previously approved these changes Nov 20, 2025
Copy link
Collaborator

@sfc-gh-okalaci sfc-gh-okalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit undecided whether we should add these functions or not. My rationale was would that cause any confusion in the code.

But then, seeing this in action, I think I like this better, we already have these concepts in the code, and this makes them even more explicit & better documented

{
currentMetadataPath = GetMetadataLocationFromExternalObjectStoreCatalogForTable(relationId);
}
else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more strict about this function, we shouldn't use it unless the table is an external iceberg table:


char *
GetIcebergExternalMetadataLocation(Oid relationId)
{
	IcebergCatalogType icebergCatalogType = GetIcebergCatalogType(relationId);

	if (icebergCatalogType == REST_CATALOG_READ_ONLY)
	{
		return GetMetadataLocationForRestCatalogForIcebergTable(relationId);
	}
	else if (icebergCatalogType == OBJECT_STORE_READ_ONLY)
	{
		return GetMetadataLocationFromExternalObjectStoreCatalogForTable(relationId);
	}
else if (icebergCatalogType == NOT_ICEBERG_TABLE)
{
	FindDataFormatAndCompression(tableType, path, options,
								 &sourceFormat, &sourceCompression);
if (sourceFormat == DATA_FORMAT_ICEBERG)
      return GetForeignTablePath(relationId);;
}

	ereport(ERROR, "expected ... not found..");
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, we might want to have another function on top that can get metadata location for any table type. I needed that recently. Maybe better to return NULL and let callers decide.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should check for FindDataFormatAndCompression and sourceFormat == DATA_FORMAT_ICEBERG. If not, we could return NULL instead of err, that's fine. In the worst case, that's useful for readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have Assert(IsAnyExternalIcebergTable(relationId)); But can expand with format checks as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think format check is still useful for two things: (a) make it a lot more clearer for the readers of the code on what class of tables this is (b) in the future, we might continue adding new table types -- even iceberg types -- so prevents any assumptions on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, we might want to have another function on top that can get metadata location for any table type. I needed that recently. Maybe better to return NULL and let callers decide.

I'm still in favor of throwing error instead of returning NULL. There are like ~15 callers of this function, and otherwise we'd have to add if not null checks to all the callers.

If we need a caller which prefers NULL, then we could add errorIfMissing flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now handle all external iceberg table cases properly, and throw error for any other table (considered misuse).

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch 2 times, most recently from e42b446 to 7f03451 Compare November 20, 2025 13:26
{
currentMetadataPath = GetIcebergCatalogMetadataLocation(relationId, true);
}
char *currentMetadataPath = GetIcebergExternalMetadataLocation(relationId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this is wrong. GetIcebergCatalogMetadataLocation is equivalent to Internal iceberg table. So, we should

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was typo which caused the test failure :)

@sfc-gh-okalaci sfc-gh-okalaci dismissed their stale review November 20, 2025 13:34

Given many changes happened, let me review again before merging

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch 3 times, most recently from a33ede2 to 407d6b6 Compare November 20, 2025 14:51
return;
}

if (!IsAnyInternalIcebergTable(relationId))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was missing, we should only register missing fields for internal iceberg tables.



/*
* IsAnyInternalIcebergTable - check if the table is an internal iceberg table (written by us).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, do we really need Any in these function names? I'd be fine with IsInternalIcebergTable etc

List *
CreatePostgresColumnMappingsForIcebergTableFromExternalMetadata(Oid relationId)
{
char *currentMetadataPath = NULL;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in an ideal world, we should add an assert IsAnyExternalIcebergTable(relationId) for being consistent with the function name & usage. But, that'd break VacuumRegisterMissingFields. VacuumRegisterMissingFields is really a special case, where we are dealing with an internal iceberg table, but we access its metadata.json for schema purposes.

So, I don't have a good suggestions here but if you can come up with something better, please do so,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no assert is fine. We get metadata location for any table, and operate on the metadata.json.

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch from 407d6b6 to 90f7ecc Compare November 21, 2025 03:07
@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch from 90f7ecc to 8b40701 Compare November 21, 2025 09:48
DefElem *writableOption = GetOption(options, "writable");
PgLakeTableType tableType = GetPgLakeTableType(foreignTableId);

return tableType == PG_LAKE_ICEBERG_TABLE_TYPE ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it could be tableType == PG_LAKE_ICEBERG_TABLE_TYPE but then also REST_CATALOG_READ_ONLY? Should we add tableType == PG_LAKE_ICEBERG_TABLE_TYPE && IsInternalIcebergTable || ...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, not related to this PR but needs the change because we also have writable object_store and rest catalog tables.

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch 2 times, most recently from 22c7548 to 8b9897a Compare November 22, 2025 16:59
@sfc-gh-abozkurt sfc-gh-abozkurt changed the title Utils to differentiate internal and external tables Utils to differentiate internal-external and writable-readonly tables Nov 22, 2025
Copy link
Collaborator

@sfc-gh-okalaci sfc-gh-okalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few questions/notes

if (PgLakeCopyValidityCheckHook)
PgLakeCopyValidityCheckHook(relationId);

ErrorIfReadOnlyIcebergTable(relationId);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we remove this line any everything continues to work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use correct check at postgresIsForeignRelUpdatable in pg_lake_table.c.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we call that for pushdownable COPY cases (Iceberg from URL)?

Copy link
Collaborator Author

@sfc-gh-abozkurt sfc-gh-abozkurt Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh sorry for confusion, we do not push down such cases at IsCopyFromPushdownable by correct checks. Afterwards, we hit into postgresIsForeignRelUpdatable as fallback.

default:
return NULL;
Assert(IsExternalIcebergTable(relationId));
return GetDataFileSchemaForExternalIcebergTable(path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if path is NULL? We are overlooking some of the existing cases here?

Same applies to GetLeafFieldsForTable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be as GetIcebergMetadataLocation throws error in such cases. (same as before)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIcebergMetadataLocation can be NULL after #68 for writable rest tables.

* throws an error if it is.
*/
void
ErrorIfReadOnlyIcebergTable(Oid relationId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot quite follow why we remove these functions? It was nicely documented functions, all in a single place.

Now, we are scattering this logic around, having multiple copies, and the benefit is minor improvement in the error messages?

I think if you want to change anything about this ErrorIf or Warn.. functions that they should definitely be a separate PR.

This PR has become a looooot larger than I thought, and this part seems arbitrarily added to the PR.

Can you please get these functions back?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need those in any places that will hit into table's foreign data wrapper. Instead, we hit into error via postgresIsForeignRelUpdatable check. This makes the things easier to maintain. Alter ddl hooks and truncate are the only 2 places that we need to add such errors because we did not hit into postgresIsForeignRelUpdatable check for those.

@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch 2 times, most recently from 1b05eb3 to e3858c7 Compare November 24, 2025 20:37
return;
}

if (!IsInternalIcebergTable(relationId))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vacuum should only register field for internal tables (that are in the internal tables catalog).

Copy link
Collaborator

@sfc-gh-okalaci sfc-gh-okalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, cleans up the utils that have been incrementally added, gets them into a nicer shape and less bug/error-prone.

if (!IsIcebergTable(relationId))
return NULL;

char *path = GetIcebergMetadataLocation(relationId, false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: GetIcebergMetadataLocation can go into else part

default:
return NULL;
Assert(IsExternalIcebergTable(relationId));
return GetDataFileSchemaForExternalIcebergTable(path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIcebergMetadataLocation can be NULL after #68 for writable rest tables.

if (EnableHeavyAsserts && tableType == PG_LAKE_ICEBERG_TABLE_TYPE &&
!(GetIcebergCatalogType(relationId) == REST_CATALOG_READ_ONLY ||
GetIcebergCatalogType(relationId) == OBJECT_STORE_READ_ONLY))
if (EnableHeavyAsserts && IsInternalIcebergTable(relationId))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

soon with #68 we'd end up undoing this change, we cannot do heavy asserts for writable-rest at this point.

Should we keep here as is maybe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets handle and comment about it in #68 to be consistent in this PR.

* We cannot afford to have a different schema between the
* Postgres catalogs and the iceberg catalog.
*/
if (catalogType == REST_CATALOG_READ_ONLY ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marco once asked why do we treat DATA_FORMAT_ICEBERG differently for ErrorIfSchemasDoNotMatch. We could enforce the same for DATA_FORMAT_ICEBERG

Though, let's not do it in this PR, mostly sharing some context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I also though like that but schema check had failure for some of our tests. We need to dig into it in another PR.

Signed-off-by: Aykut Bozkurt <[email protected]>
@sfc-gh-abozkurt sfc-gh-abozkurt force-pushed the aykut/refactor-internal-external branch from e3858c7 to 0d01cfc Compare November 25, 2025 20:37
@sfc-gh-abozkurt sfc-gh-abozkurt merged commit 9590c14 into main Nov 25, 2025
63 checks passed
@sfc-gh-abozkurt sfc-gh-abozkurt deleted the aykut/refactor-internal-external branch November 25, 2025 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants