Skip to content

Conversation

ldming
Copy link
Collaborator

@ldming ldming commented Oct 14, 2025

Problem Description

The addon controller had issues with stuck jobs that could prevent addon updates from being applied:

  1. No timeout mechanism: Jobs without activeDeadlineSeconds could run indefinitely if pods failed to start (e.g., due to image pull errors)
  2. Missing generation tracking: When addon was updated, old jobs might still be running but controller didn't actively clean them up
  3. Blocking behavior: New addon updates would wait indefinitely for stuck jobs to complete

Root Cause

When addon updates occur (e.g., configuration changes), the addon's generation increases. However, if there are existing jobs that are stuck (due to ImagePullBackOff or other pod startup issues), the controller would wait for these jobs indefinitely, preventing new updates from being applied.

Solution

1. Job Timeout Configuration

  • Added activeDeadlineSeconds to all helm jobs (default: 5 minutes)
  • Configurable via environment variable KUBEBLOCKS_ADDON_JOB_TIMEOUT
  • Prevents jobs from running indefinitely

2. Generation Tracking

  • Added generation annotation addon.kubeblocks.io/generation to jobs
  • Tracks which addon generation each job belongs to
  • Enables automatic cleanup of outdated jobs

3. Outdated Job Detection and Cleanup

  • Added isJobOutdated() function to check if job belongs to older generation
  • Automatically delete outdated jobs when addon is updated
  • Allows new jobs to be created for the current generation

Changes Made

  • controllers/extensions/addon_controller_stages.go:

    • Add job timeout configuration (5 minutes default)
    • Add generation annotation to all helm jobs
    • Implement outdated job detection and cleanup logic
    • Add isJobOutdated() helper function
  • controllers/extensions/const.go:

    • Add AddonGeneration constant for annotation key
  • controllers/extensions/addon_controller_test.go:

    • Add comprehensive test case for job cleanup scenarios
    • Verify generation tracking and timeout configuration
  • Configuration files:

    • Add support for KUBEBLOCKS_ADDON_JOB_TIMEOUT environment variable

Testing

  • Added test case "should cleanup outdated jobs when addon is updated"
  • Verifies that outdated jobs are properly deleted when addon generation changes
  • Ensures new jobs are created with correct generation annotation and timeout

Benefits

  1. Prevents blocking: New addon updates can proceed without waiting for stuck jobs
  2. Resource cleanup: Outdated jobs are automatically cleaned up
  3. Configurable timeout: Administrators can adjust job timeout based on their environment
  4. Better reliability: Reduces the chance of addon updates getting stuck indefinitely

This change ensures that addon updates are more reliable and responsive, especially in environments where image pull issues or other pod startup problems might occur.

…dates

- Add job timeout (5 minutes default) to prevent indefinite hanging
- Implement generation tracking for jobs to enable cleanup of outdated jobs
- Add automatic cleanup of outdated jobs when addon generation changes
- Ensure new addon updates can proceed without waiting for stuck jobs
- Add comprehensive test coverage for job cleanup scenarios
- Use constants for annotation keys to improve maintainability

This resolves issues where addon updates were blocked by stuck jobs
due to image pull failures or other pod startup issues.
@ldming ldming requested a review from a team as a code owner October 14, 2025 01:03
@apecloud-bot
Copy link
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.0

@github-actions github-actions bot added the size/L Denotes a PR that changes 100-499 lines. label Oct 14, 2025
@ldming ldming marked this pull request as draft October 14, 2025 02:18
@ldming ldming force-pushed the support/improve-addon-controller-when-the-job-hang branch from 63cae6b to adeb054 Compare October 14, 2025 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants