Skip to content

feat(chunkRetryDeadline): [DO NOT MERGE] Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline#4473

Open
meet2mky wants to merge 9 commits intomasterfrom
mohit-checkpoint-fix
Open

feat(chunkRetryDeadline): [DO NOT MERGE] Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline#4473
meet2mky wants to merge 9 commits intomasterfrom
mohit-checkpoint-fix

Conversation

@meet2mky
Copy link
Collaborator

@meet2mky meet2mky commented Mar 12, 2026

Description

The PR aims to provide better control over how GCSFuse handles retries for chunk uploads, particularly during write stalls. It increases the default chunkRetryDeadline from 32 seconds to 120 seconds, allowing more time for successful retries of temporary failures (like 503 errors or potential stalls due to lock contention during high scale checkpoint workloads)
Changes:

  • It propagates the hidden chunk-retry-deadline-secs flag to ensure the configurable deadline is respected throughout the object writing process.
  • The default value for the chunk retry deadline is moved to 120 seconds to improve robustness.
  • The PR adds documentation notes for gRPC object writers clarifying that chunkRetryDeadline and chunkTransferTimeout do not apply there.
  • New integration tests have been added to verify the behavior of the chunk-retry-deadline-secs flag during write stalls, along with a helper function to parse the flag.
  • Several constants related to chunk retries and timeouts in inode tests were unexported to improve the codebase's internal structure.

Link to the issue in case of a bug fix.

b/491635218

Testing details

  1. Manual - Yes
  2. Unit tests - Yes
  3. Integration tests - Part of pre-submits

Any backward incompatible change? If so, please explain.

None

@meet2mky meet2mky added execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. labels Mar 12, 2026
@meet2mky meet2mky changed the title feat(chunkRetryDeadline): Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline feat(chunkRetryDeadline): [DO NOT MERGE] Propagate hidden chunkRetryDeadline flag and use new default for chunkRetryDeadline Mar 12, 2026
@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 83.34%. Comparing base (2aa2028) to head (f86d18e).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
cmd/mount.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4473      +/-   ##
==========================================
- Coverage   83.36%   83.34%   -0.03%     
==========================================
  Files         163      163              
  Lines       19972    19989      +17     
==========================================
+ Hits        16649    16659      +10     
- Misses       2707     2712       +5     
- Partials      616      618       +2     
Flag Coverage Δ
unittests 83.34% <94.44%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meet2mky meet2mky marked this pull request as ready for review March 12, 2026 15:49
@meet2mky meet2mky requested review from a team and Tulsishah as code owners March 12, 2026 15:49
@github-actions github-actions bot added the remind-reviewers Auto remind reviewers in attention set for review post 24hrs of inactivity on PR. label Mar 12, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the robustness of GCSFuse's chunk upload retry mechanism. By extending the default retry deadline and ensuring the chunkRetryDeadline flag is consistently applied, the system is better equipped to handle transient network issues and write stalls, particularly in demanding checkpoint workloads. The changes also include comprehensive testing and clearer documentation for developers.

Highlights

  • Increased Default Chunk Retry Deadline: The default chunkRetryDeadline for chunk uploads has been increased from 32 seconds to 120 seconds to provide more time for successful retries during temporary failures or write stalls.
  • Propagated chunkRetryDeadline Flag: The hidden chunk-retry-deadline-secs flag is now propagated throughout the object writing process, ensuring that the configurable deadline is respected.
  • New Integration Tests: New integration tests have been added to verify the behavior of the chunk-retry-deadline-secs flag, specifically focusing on scenarios involving write stalls.
  • Documentation Updates: Documentation notes have been added for gRPC object writers, clarifying that chunkRetryDeadline and chunkTransferTimeout do not apply to them.
  • Codebase Refinements: Several constants related to chunk retries and timeouts in inode tests were unexported to improve the internal structure of the codebase.
Changelog
  • cmd/mount.go
    • Propagated ChunkRetryDeadlineSecs to the MountConfig structure.
  • internal/bufferedwrites/buffered_write_handler.go
    • Added ChunkRetryDeadlineSecs field to CreateBWHandlerRequest and passed it during handler creation.
  • internal/bufferedwrites/buffered_write_handler_test.go
    • Defined chunkRetryDeadlineSecs constant and incorporated it into test setups for CreateBWHandlerRequest.
  • internal/bufferedwrites/upload_handler.go
    • Introduced chunkRetryDeadline field to UploadHandler and ChunkRetryDeadlineSecs to CreateUploadHandlerRequest, passing it to gcs.NewCreateObjectRequest.
  • internal/bufferedwrites/upload_handler_test.go
    • Passed chunkRetryDeadlineSecs to CreateUploadHandlerRequest in upload handler tests.
  • internal/fs/fs_test.go
    • Added chunkRetryDeadlineSecs to fakeBucketManager and updated gcsx.NewSyncerBucket calls.
  • internal/fs/gcs_metrics_test.go
    • Updated gcsx.NewSyncerBucket instantiation to include the new chunkRetryDeadlineSecs parameter.
  • internal/fs/handle/dir_handle_test.go
    • Modified gcsx.NewSyncerBucket calls to pass chunkRetryDeadlineSecs.
  • internal/fs/handle/file_test.go
    • Updated multiple gcsx.NewSyncerBucket calls to include the chunkRetryDeadlineSecs parameter.
  • internal/fs/inode/base_dir_test.go
    • Defined chunkRetryDeadlineSecs constant and updated gcsx.NewSyncerBucket calls in tests.
  • internal/fs/inode/core_test.go
    • Updated gcsx.NewSyncerBucket call to include chunkRetryDeadlineSecs.
  • internal/fs/inode/dir_prefetcher_test.go
    • Modified gcsx.NewSyncerBucket call to pass chunkRetryDeadlineSecs.
  • internal/fs/inode/dir_test.go
    • Updated gcsx.NewSyncerBucket calls to incorporate chunkRetryDeadlineSecs.
  • internal/fs/inode/file.go
    • Passed ChunkRetryDeadlineSecs from the configuration to bufferedwrites.CreateBWHandlerRequest.
  • internal/fs/inode/file_mock_bucket_test.go
    • Updated gcsx.NewSyncerBucket calls to include chunkRetryDeadlineSecs.
  • internal/fs/inode/file_streaming_writes_test.go
    • Modified gcsx.NewSyncerBucket call to pass chunkRetryDeadlineSecs.
  • internal/fs/inode/file_test.go
    • Updated gcsx.NewSyncerBucket call to include chunkRetryDeadlineSecs.
  • internal/fs/inode/hns_dir_test.go
    • Modified gcsx.NewSyncerBucket call to pass chunkRetryDeadlineSecs.
  • internal/fs/inode/recursive_cancellation_test.go
    • Updated gcsx.NewSyncerBucket call to include chunkRetryDeadlineSecs.
  • internal/fs/inode/symlink_test.go
    • Modified gcsx.NewSyncerBucket call to pass chunkRetryDeadlineSecs.
  • internal/gcsx/bucket_manager.go
    • Added ChunkRetryDeadlineSecs to BucketConfig and passed it when setting up buckets.
  • internal/gcsx/compose_object_creator.go
    • Added chunkRetryDeadlineSecs parameter to the Create method and passed it to gcs.NewCreateObjectRequest.
  • internal/gcsx/compose_object_creator_test.go
    • Passed chunkRetryDeadlineSecs to composeObjectCreator.Create in tests.
  • internal/gcsx/integration_test.go
    • Defined chunkRetryDeadlineSecs constant and passed it to gcsx.NewSyncer.
  • internal/gcsx/syncer.go
    • Added chunkRetryDeadlineSecs parameter to NewSyncer and newSyncer functions, the syncer struct, and the objectCreator interface, ensuring it's passed in Create calls.
  • internal/gcsx/syncer_bucket.go
    • Added chunkRetryDeadlineSecs parameter to NewSyncerBucket and passed it to NewSyncer.
  • internal/gcsx/syncer_test.go
    • Added chunkRetryDeadlineSecs parameter to fakeObjectCreator.Create, defined it as a constant, and passed it to newSyncer and fullCreator.Create in tests.
  • internal/storage/bucket_handle.go
    • Set wc.ChunkRetryDeadline using req.ChunkRetryDeadlineSecs in CreateObject and CreateObjectChunkWriter.
  • internal/storage/gcs/request.go
    • Added ChunkRetryDeadlineSecs field to CreateObjectRequest with detailed documentation and a default value of 120 seconds.
  • internal/storage/gcs/request_helper.go
    • Added chunkRetryDeadlineSecs parameter to NewCreateObjectRequest and used it to populate the request.
  • internal/storage/gcs/request_helper_test.go
    • Updated NewCreateObjectRequest calls in tests to include the chunkRetryDeadlineSecs parameter.
  • tools/integration_tests/emulator_tests/configs/write_stalls_four_times_60s.yaml
    • Added a new configuration file for emulator tests to simulate write stalls with specific retry instructions.
  • tools/integration_tests/emulator_tests/util/test_helper.go
    • Added GetChunkRetryDeadlineFromFlags function to parse the chunk-retry-deadline-secs flag from command-line arguments.
  • tools/integration_tests/emulator_tests/write_stall/writes_stall_on_sync_test.go
    • Added TestChunkRetryDeadline with scenarios to test successful and failed writes based on the chunk-retry-deadline-secs setting during simulated stalls.
Activity
  • Manual testing was performed to validate the changes.
  • Unit tests were executed and passed.
  • Integration tests were run as part of pre-submits to ensure system-level functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully propagates the chunkRetryDeadline configuration throughout the codebase and increases its default value to enhance robustness against write stalls. The changes are extensive but logical, and the inclusion of new integration tests is commendable. My review identifies a couple of areas for improvement related to documentation and code comments to enhance clarity and maintainability.

// For resumable uploads, the Writer will terminate the request and attempt a retry
// if the request to upload a particular chunk stalls for longer than this duration. Retries
// may continue until the ChunkRetryDeadline(32s) is reached.
// if the request to upload a particular chunk stalls for longer than this duration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The updated comment for ChunkTransferTimeoutSecs has lost some important context. The previous version mentioned that retries would continue until ChunkRetryDeadline is reached. This relationship is key to understanding how the two timeouts work together. Please consider re-adding this information for clarity.

Suggested change
// if the request to upload a particular chunk stalls for longer than this duration.
// if the request to upload a particular chunk stalls for longer than this duration. Retries
// may continue until ChunkRetryDeadlineSecs is reached.

expectedSuccess: false,
},
}
// 4 stalls of 60s each causing 4 retry chunk stalls and 5th retry succeeds after 40s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is confusing as it doesn't seem to align with the test's configuration. The proxy is set to stall 4 times, but the comment mentions a 5th retry. To avoid misleading future developers, it would be best to remove this comment or update it to accurately describe the test scenario.

@github-actions
Copy link

Hi @vadlakondaswetha, @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you!

func (t *fileTest) Test_ReadWithMrdKernelReader_NotAuthoritative() {
// 1. Setup
zonalBucket := gcsx.NewSyncerBucket(1, 10, ".gcsfuse_tmp/", fake.NewFakeBucket(&t.clock, "zonal_bucket", gcs.BucketType{Zonal: true}))
zonalBucket := gcsx.NewSyncerBucket(1 /* appendThreshold */, 120 /* chunkRetryDeadlineSecs */, 10 /* chunkTransferTimeoutSecs */, ".gcsfuse_tmp/", fake.NewFakeBucket(&t.clock, "zonal_bucket", gcs.BucketType{Zonal: true}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment here looks odd. please see what is the recommended way for golang

@github-actions
Copy link

Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you!

3 similar comments
@github-actions
Copy link

Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you!

@github-actions
Copy link

Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you!

@github-actions
Copy link

Hi @abhishek10004, @Tulsishah, your feedback is needed to move this pull request forward. This automated reminder was triggered because there has been no activity for over 24 hours. Please provide your input when you have a moment. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

execute-integration-tests Run only integration tests execute-integration-tests-on-zb To run E2E tests on zonal bucket. remind-reviewers Auto remind reviewers in attention set for review post 24hrs of inactivity on PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants