Table commit retries based on table properties #330

Buktoria · 2024-01-30T21:00:41Z

Created a decorator which when applied to a function performs commits, and retries the function on the table. It will look at the table properties and perform reties if the execution fails.

Created a Decorator / Descriptor Class that can wrap a function and retry it using the Tenacity retry library
The class configures defaults based on the documented defaults found in the Iceberg docs https://iceberg.apache.org/docs/latest/configuration/#table-behavior-properties
- commit.retry.num-retries
- commit.retry.min-wait-ms
- commit.retry.max-wait-ms
- commit.retry.total-timeout-ms
Config is parsed from a configured "properties" attribute/property on the instance class that is accessed within the decorator at runtime
A separate function table_commit_retry is used to capture the the name of the attribute on the caller that should be used when looking up table configs.
Access to the caller instance is performed via overloading the __get__ method of the class
Un-parsable config will be ignored and defaults will be used

Closes: #269

pyiceberg/table/__init__.py

Fokko · 2024-02-01T08:00:11Z

pyiceberg/table/__init__.py

+
+    def get_config(self, config: str, default: int) -> int:
+        """Get config out of the properties."""
+        return self.to_int(self.table_properties.get(config, ""), default)


If the key doesn't exists we try to convert the empty string. How about throwing in some walrus :=:

Suggested change

return self.to_int(self.table_properties.get(config, ""), default)

return self.to_int(value) if (value := self.table_properties.get(config)) else default

Fokko · 2024-02-01T08:50:27Z

pyiceberg/table/__init__.py

@@ -994,6 +1065,7 @@ def refs(self) -> Dict[str, SnapshotRef]:
        """Return the snapshot references in the table."""
        return self.metadata.refs

+    @table_commit_retry("properties")


It depends on what we are trying to do here. There are two types of retries that we want to support:

Intermittent network issues, catalog temporarily not available, etc.

Retrying of commits because the table changed.

The first one should probably be done on the catalog level because we also need do differentiate between the different errors, and see if they are retriable.

For the second case, the one that you are solving here, we need some more logic around loading the latest version of the table. The retry is being done on the CommitFailedException which is thrown at a HTTP409 of the REST catalog. A 409 means a conflict and that the table has changed. At this point, we only support append and overwrite operations, which don't need any conflict detection. I believe that's Einstein's definition of insanity :)

Do we want to refresh the table metadata, and reapply the changes? I would expect like in

iceberg-python/tests/catalog/test_sql.py

Lines 803 to 823 in b1e33d4

@pytest.mark.parametrize(

'catalog',

[

lazy_fixture('catalog_memory'),

lazy_fixture('catalog_sqlite'),

lazy_fixture('catalog_sqlite_without_rowcount'),

],

)

def test_concurrent_commit_table(catalog: SqlCatalog, table_schema_simple: Schema, random_identifier: Identifier) -> None:

database_name, _table_name = random_identifier

catalog.create_namespace(database_name)

table_a = catalog.create_table(random_identifier, table_schema_simple)

table_b = catalog.load_table(random_identifier)

with table_a.update_schema() as update:

update.add_column(path="b", field_type=IntegerType())

with pytest.raises(CommitFailedException, match="Requirement failed: current schema id has changed: expected 0, found 1"):

# This one should fail since it already has been updated

with table_b.update_schema() as update:

update.add_column(path="c", field_type=IntegerType())

to fail after this change.

My take on this is that if we try to implement retries in multiple places it will make it harder to prevent compounding retries, and thus result in a multiplying effect in the total number of retries, which is not a behaviour we want.

I would argue we should have just one place where we apply retries and it should keep the state of the number of attempts.

We could catch multiple errors here instead of just CommitFailedException to account for network errors.

In terms of accounting for table changes, that is a good point I had not originally considered. If the table has changed we would continue seeing the same error, in which case we need to perform a refresh. Refresh is an operation on the Table class, not the Catalog. So we need to have access to Table. With the current positioning of the retry decorator, we have access to the table instance so we can refresh the table.

What we can do is after every attempt we refresh the table. This is not optimal in the case where we failed due to network issues, since the table may have not changed. I would argue that the penalty that we are taking, in this case, would be minimal and is more than enough to pay for a more simplistic approach.

Buktoria · 2024-03-18T18:38:38Z

So I made a large fundamental change to the original design, where catalogs need to implement a function where they declare what exceptions are retryable. This becomes the bridge between the Table and Catalog. Since Table contains an instance of Catalog, our retry wrapper can grab this list of exceptions through the Table instance.

Retrying happens within the Table object and wraps the _do_commit function.

Since Table calls this function, we can grab a reference to the Table object which we can then use to load the table's properties and commit_retry_exceptions.
With this information we can build the Retry Controler
To support executing refresh before a new attempt but after sleeping, we grab the exception the attempt received, hold on to it, and then on the next attempt but before running _do_commit we check to see if the exception requires a refresh of the table.
- I had to do this because Tenacity does not have an after_sleep parameter, even though its supports taking a before_sleep parameter.

sungwy

Hi @Buktoria - thank you for working on this PR. The TableCommitRetry looks well organized and I think this will be a great feature enhancement to PyIceberg!

I think it will be helpful to add integration tests that simulate different retry scenarios. I think retries in Iceberg can be complex, and this will help us understand if the current implementation does successfully handle retries for a typical Iceberg commit.

As a concrete example, when a concurrent snapshot update is made to the Iceberg table, the expectation is that the next retry will be based against the new metadata as well as the new snapshot-id. In the current implementation, it looks like the table updates and requirements remain unchanged, which may lead to the commit being retried multiple times, and just failing in the end.

We use the AssertRefSnapshotId requirement when we are producing a new snapshot, to ensure that the table's snapshot ID hasn't changed. If there's a concurrent snapshot update, this condition will fail multiple times, unless we update the stored Table Requirement as well.

sungwy · 2024-05-13T17:48:22Z

tests/table/test_init.py

+    class CustomException(Exception):
+        pass
+
+    class TestTableCommitRetiesCustomError:


typo:

Suggested change

class TestTableCommitRetiesCustomError:

class TestTableCommitRetriesCustomError:

potatochipcoconut · 2025-06-21T18:11:18Z

@Buktoria is this still going to move forward?

Buktoria force-pushed the vicky/catalog-commit-retries branch 3 times, most recently from edd1aad to af3f54b Compare January 31, 2024 16:14

Buktoria marked this pull request as ready for review January 31, 2024 16:25

Fokko reviewed Feb 1, 2024

View reviewed changes

Fokko added this to the PyIceberg 0.7.0 release milestone Feb 7, 2024

Buktoria force-pushed the vicky/catalog-commit-retries branch from af3f54b to 0696c76 Compare March 18, 2024 14:08

Table commit retries based on table properties

b3c468e

Buktoria force-pushed the vicky/catalog-commit-retries branch from 0696c76 to b3c468e Compare March 18, 2024 18:24

sungwy reviewed May 13, 2024

View reviewed changes

sungwy modified the milestones: PyIceberg 0.7.0 release, PyIceberg 0.8.0 release May 31, 2024

sungwy removed this from the PyIceberg 0.8.0 release milestone Sep 24, 2024

kevinjqliu added this to the PyIceberg 0.9.0 release milestone Oct 30, 2024

kevinjqliu removed this from the PyIceberg 0.9.0 release milestone Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Table commit retries based on table properties #330

Table commit retries based on table properties #330

Uh oh!

Buktoria commented Jan 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko Feb 1, 2024

Uh oh!

Fokko Feb 1, 2024

Uh oh!

Buktoria Feb 1, 2024

Uh oh!

Buktoria commented Mar 18, 2024 •

edited

Loading

Uh oh!

sungwy left a comment

Uh oh!

sungwy May 13, 2024

Uh oh!

potatochipcoconut commented Jun 21, 2025

Uh oh!

Uh oh!

	return self.to_int(self.table_properties.get(config, ""), default)
	return self.to_int(value) if (value := self.table_properties.get(config)) else default

	@pytest.mark.parametrize(
	'catalog',
	[
	lazy_fixture('catalog_memory'),
	lazy_fixture('catalog_sqlite'),
	lazy_fixture('catalog_sqlite_without_rowcount'),
	],
	)
	def test_concurrent_commit_table(catalog: SqlCatalog, table_schema_simple: Schema, random_identifier: Identifier) -> None:
	database_name, _table_name = random_identifier
	catalog.create_namespace(database_name)
	table_a = catalog.create_table(random_identifier, table_schema_simple)
	table_b = catalog.load_table(random_identifier)

	with table_a.update_schema() as update:
	update.add_column(path="b", field_type=IntegerType())

	with pytest.raises(CommitFailedException, match="Requirement failed: current schema id has changed: expected 0, found 1"):
	# This one should fail since it already has been updated
	with table_b.update_schema() as update:
	update.add_column(path="c", field_type=IntegerType())

	class TestTableCommitRetiesCustomError:
	class TestTableCommitRetriesCustomError:

Table commit retries based on table properties #330

Are you sure you want to change the base?

Table commit retries based on table properties #330

Uh oh!

Conversation

Buktoria commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Buktoria Feb 1, 2024

Choose a reason for hiding this comment

Uh oh!

Buktoria commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

sungwy May 13, 2024

Choose a reason for hiding this comment

Uh oh!

potatochipcoconut commented Jun 21, 2025

Uh oh!

Uh oh!

Buktoria commented Jan 30, 2024 •

edited

Loading

Buktoria commented Mar 18, 2024 •

edited

Loading