Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make s3.request_timeout configurable #1568

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

metadaddy
Copy link

Similarly to #218, we see occasional timeout errors when writing data to S3-compatible object storage:

When uploading part for key 'drivestats/data/date_month=2014-08/00000-0-9c7baab5-af18-4558-ae10-1678aa90b6a5.parquet' in bucket 'drivestats-iceberg': AWS Error NETWORK_CONNECTION during UploadPart operation: curlCode: 28, Timeout was reached

[I don't believe the issue is specific to the fact that I'm using Backblaze B2 rather than Amazon S3 - I saw references to similar error messages with the latter as I was researching this issue.]

The issue happens when the underlying PUT operation takes longer than the request timeout, which is set to a default of 3 seconds in the AWS C++ SDK used by Arrow via PyArrow.

The changes in this PR allow configuration of s3.request_timeout when working directly or indirectly with pyiceberg.io.pyarrow.PyArrowFileIO, just as #218 allowed configuration of s3.connect_timeout.

For example, when creating a catalog:

catalog = load_catalog(
    "docs",
    **{
        "uri": "http://127.0.0.1:8181",
        "s3.endpoint": "http://127.0.0.1:9000",
        "py-io-impl": "pyiceberg.io.pyarrow.PyArrowFileIO",
        "s3.access-key-id": "admin",
        "s3.secret-access-key": "password",
        "s3.request-timeout": 5.0,
        "s3.connect-timeout": 20.0,
    }
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant