feat(chunkRetryDeadline): [DO NOT MERGE] Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline#4473
feat(chunkRetryDeadline): [DO NOT MERGE] Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline#4473
Conversation
…during write stalls and introduce a helper to parse the flag.
…chunk retry/timeout parameters in GCS requests.
… gRPC object writers and add corresponding documentation notes.
…ts and correcting a file path string.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #4473 +/- ##
==========================================
- Coverage 83.36% 83.34% -0.03%
==========================================
Files 163 163
Lines 19972 19989 +17
==========================================
+ Hits 16649 16659 +10
- Misses 2707 2712 +5
- Partials 616 618 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the robustness of GCSFuse's chunk upload retry mechanism. By extending the default retry deadline and ensuring the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully propagates the chunkRetryDeadline configuration throughout the codebase and increases its default value to enhance robustness against write stalls. The changes are extensive but logical, and the inclusion of new integration tests is commendable. My review identifies a couple of areas for improvement related to documentation and code comments to enhance clarity and maintainability.
| // For resumable uploads, the Writer will terminate the request and attempt a retry | ||
| // if the request to upload a particular chunk stalls for longer than this duration. Retries | ||
| // may continue until the ChunkRetryDeadline(32s) is reached. | ||
| // if the request to upload a particular chunk stalls for longer than this duration. |
There was a problem hiding this comment.
The updated comment for ChunkTransferTimeoutSecs has lost some important context. The previous version mentioned that retries would continue until ChunkRetryDeadline is reached. This relationship is key to understanding how the two timeouts work together. Please consider re-adding this information for clarity.
| // if the request to upload a particular chunk stalls for longer than this duration. | |
| // if the request to upload a particular chunk stalls for longer than this duration. Retries | |
| // may continue until ChunkRetryDeadlineSecs is reached. |
| expectedSuccess: false, | ||
| }, | ||
| } | ||
| // 4 stalls of 60s each causing 4 retry chunk stalls and 5th retry succeeds after 40s. |
There was a problem hiding this comment.
|
Hi @vadlakondaswetha, @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you! |
| func (t *fileTest) Test_ReadWithMrdKernelReader_NotAuthoritative() { | ||
| // 1. Setup | ||
| zonalBucket := gcsx.NewSyncerBucket(1, 10, ".gcsfuse_tmp/", fake.NewFakeBucket(&t.clock, "zonal_bucket", gcs.BucketType{Zonal: true})) | ||
| zonalBucket := gcsx.NewSyncerBucket(1 /* appendThreshold */, 120 /* chunkRetryDeadlineSecs */, 10 /* chunkTransferTimeoutSecs */, ".gcsfuse_tmp/", fake.NewFakeBucket(&t.clock, "zonal_bucket", gcs.BucketType{Zonal: true})) |
There was a problem hiding this comment.
the comment here looks odd. please see what is the recommended way for golang
|
Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you! |
3 similar comments
|
Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you! |
|
Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you! |
|
Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you! |
Description
The PR aims to provide better control over how GCSFuse handles retries for chunk uploads, particularly during write stalls. It increases the default chunkRetryDeadline from 32 seconds to 120 seconds, allowing more time for successful retries of temporary failures (like 503 errors or potential stalls due to lock contention during high scale checkpoint workloads)
Changes:
Link to the issue in case of a bug fix.
b/491635218
Testing details
Any backward incompatible change? If so, please explain.
None