Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Controller Revision (Implementation of KEP #238) #277

Merged
merged 27 commits into from
Dec 28, 2024

Conversation

Edwinhr716
Copy link
Contributor

@Edwinhr716 Edwinhr716 commented Dec 6, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it

This PR adds controller revision. This allows the controller to store previous iterations of the LWS object, which then makes it possible to select the correct pod template spec to use if a replica is restarted during rolling update.

Which issue(s) this PR fixes

Fixes #238
Fixes #239
Fixes #240
Fixes #281
Fixes #291

Special notes for your reviewer

Does this PR introduce a user-facing change?


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 6, 2024
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g December 6, 2024 19:11
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 6, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 16, 2024
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, I didn't look at tests.

api/leaderworkerset/v1/leaderworkerset_types.go Outdated Show resolved Hide resolved
api/leaderworkerset/v1/leaderworkerset_types.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments on simplifying the history pkg

pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/utils/controller/controller_utils.go Outdated Show resolved Hide resolved
pkg/history/controller_history.go Outdated Show resolved Hide resolved
test/testutils/util.go Outdated Show resolved Hide resolved
test/testutils/wrappers.go Outdated Show resolved Hide resolved
@Edwinhr716
Copy link
Contributor Author

Addressed majority of comments, still need to add more tests and rebase to be ready to merge

@ahg-g
Copy link
Contributor

ahg-g commented Dec 23, 2024

Great, i will look later today, please ensure we have sufficient integration test coverage

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 23, 2024
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/utils/revision/revision_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
@@ -387,6 +410,9 @@ func (r *LeaderWorkerSetReconciler) updateConditions(ctx context.Context, lws *l
conditions = append(conditions, makeCondition(leaderworkerset.LeaderWorkerSetUpgradeInProgress))
} else if updatedAndReadyCount == int(*lws.Spec.Replicas) {
conditions = append(conditions, makeCondition(leaderworkerset.LeaderWorkerSetAvailable))
if err := revisionutils.TruncateHistory(ctx, r.Client, lws, templateHash); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow up, we can store the current and updated revisions for debugging purposes. The current one can be set here, while the updated one can be set in the if block above.

pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
test/testutils/wrappers.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 27, 2024
@Edwinhr716
Copy link
Contributor Author

Finished addressing comments. I agree, makes sense to change templateHash to revisionKey. Also changed that in the latest commit.

@ahg-g ahg-g changed the title [WIP] Add Controller Revision (Implementation of KEP #238) Add Controller Revision (Implementation of KEP #238) Dec 27, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 27, 2024
pkg/utils/revision/revision_utils.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/utils/revision/revision_utils.go Outdated Show resolved Hide resolved
pkg/utils/revision/revision_utils.go Show resolved Hide resolved
@Edwinhr716
Copy link
Contributor Author

Edwinhr716 commented Dec 28, 2024

Addressed comments. Flakiness of tests is much higher now that everything uses GetRevisionKey(), unsure why. I'll retest later to see if that fixes it.

FYI I do want to add integration tests before merging which is why I had the WIP tag in the title

pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/leaderworkerset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/pod_controller_test.go Outdated Show resolved Hide resolved
pkg/utils/revision/revision_utils.go Outdated Show resolved Hide resolved
test/e2e/e2e_test.go Outdated Show resolved Hide resolved
return fmt.Errorf("StatefulSet did not have the expected container name")
}
return nil
}, timeout, interval).Should(gomega.Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will be flaky, how do you guarantee that the update isn't faster than your check here?

Copy link
Contributor Author

@Edwinhr716 Edwinhr716 Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee, that is just a risk that is taken when checking the state mid update. Similar issue in other e2e tests. All I can do is have as many replicas as possible so that the update takes long to reach the "-0" leader sts. Hasn't failed after many runs so far though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have tests that don't offer guarantees, those will flake, which means when they fail, we don't know if they fail because of a bug or the race condition. The test you are trying to do here should be done via an integration test where we control update progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but other update tests also don't offer guarantees. For example this one https://github.com/kubernetes-sigs/lws/blob/main/test/e2e/e2e_test.go#L181, what guarantees that the update isn't done before this check is performed? I guess I'm just confused as to why we can do it there but not here, both are checking the state mid update.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 28, 2024
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 28, 2024

@Edwinhr716: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-lws-test-e2e-main-1-28 0ff8f71 link true /test pull-lws-test-e2e-main-1-28

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


gomega.Eventually(func() (bool, error) {
return initialLeaderPod.UID == midUpdateLeaderPod.UID, nil
}, timeout, interval).Should(gomega.Equal(false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in an eventually clause? nothing is changing inside it

@ahg-g
Copy link
Contributor

ahg-g commented Dec 28, 2024

/lgtm
/approve

Will use #294 to track integration tests

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Edwinhr716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 28, 2024
@k8s-ci-robot k8s-ci-robot merged commit c03d4c9 into kubernetes-sigs:main Dec 28, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
3 participants