-
Notifications
You must be signed in to change notification settings - Fork 473
[FLINK-37515] Basic support for Blue/Green deployments #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-1.11
Are you sure you want to change the base?
[FLINK-37515] Basic support for Blue/Green deployments #969
Conversation
...tor-api/src/main/java/org/apache/flink/kubernetes/operator/api/bluegreen/DeploymentType.java
Show resolved
Hide resolved
...main/java/org/apache/flink/kubernetes/operator/api/status/FlinkBlueGreenDeploymentState.java
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
I am concerned about the number of TODOs. Things like dealing with errors and timing conditions should be in place in the initial drop of code - unless we can mark this capability as beta or the like. WDYT @gyfora ? |
@davidradl thanks for your review/comments. I've cleaned up all the outdated/invalid TODOs which pointed me 1 missing unit test, added. At this point the few remaining TODOs do not interfere with the main functionality and are not critical/open for discussion. |
… BASIC is supported).Tests to confirm the deployments are deleted accurately after the specified deletion delay.
👍 |
when this feature gonna be released? |
Hi @hansh0801, trying to release it ASAP but I also want to make sure the logic and contracts are as robust as possible to minimize changes later. Looks pretty stable in my current testing and I've reopened it for review. Thanks! |
cool ! and I'm also waiting for blue green phase 2 :) |
@hansh0801 indeed... Phase 2 will take a little bit longer due to its complexity but already working on it. |
...tor-api/src/main/java/org/apache/flink/kubernetes/operator/api/bluegreen/DeploymentType.java
Outdated
Show resolved
Hide resolved
...tor-api/src/main/java/org/apache/flink/kubernetes/operator/api/bluegreen/TransitionMode.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/kubernetes/operator/api/spec/FlinkDeploymentTemplateSpec.java
Outdated
Show resolved
Hide resolved
...ain/java/org/apache/flink/kubernetes/operator/api/status/FlinkBlueGreenDeploymentStatus.java
Outdated
Show resolved
Hide resolved
private long deploymentReadyTimestamp; | ||
|
||
/** Information about the TaskManagers for the scale subresource. */ | ||
private TaskManagerInfo taskManager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to support this? I feel that the scale logic using TM count is not really used feature for regular FlinkDeployments either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could remove it for now and only add it if there is a need
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
if (deploymentStatus == null) { | ||
deploymentStatus = new FlinkBlueGreenDeploymentStatus(); | ||
return patchStatusUpdateControl( | ||
bgDeployment, | ||
deploymentStatus, | ||
FlinkBlueGreenDeploymentState.INITIALIZING_BLUE, | ||
null) | ||
.rescheduleAfter(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BlueGreenDeployment could simply implement initStatus and this can be removed completely.
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
...java/org/apache/flink/kubernetes/operator/controller/FlinkBlueGreenDeploymentController.java
Outdated
Show resolved
Hide resolved
private boolean hasSpecChanged( | ||
FlinkBlueGreenDeploymentSpec newSpec, FlinkBlueGreenDeploymentStatus deploymentStatus) { | ||
|
||
String lastReconciledSpec = deploymentStatus.getLastReconciledSpec(); | ||
String newSpecSerialized = SpecUtils.serializeObject(newSpec, "spec"); | ||
|
||
// TODO: in FLIP-504 check here the TransitionMode has not been changed | ||
|
||
return !lastReconciledSpec.equals(newSpecSerialized); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to use the DiffBuilder/ existing utilities to get the type of the spec change here and make sure we don't trigger deployments on non-upgrade changes.
Things like operator config changes etc should not go through the Blue/Green flow I believe but simply applied on the currently active deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me get familiar with the DiffBuilder
and see how to incorporate it
boolean deleted = | ||
deletedStatus.size() == 1 | ||
&& deletedStatus.get(0).getKind().equals("FlinkDeployment"); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to wait here until the deletion is actually finished (the object is not there anymore?) that would require some additional logic here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not introduce waiting logic. I immediately reschedule a reconciliation afterwards which will re-execute this portion if the deployment is not yet gone.
TRANSITIONING_TO_BLUE, | ||
|
||
/** Identifies the system is transitioning from "Blue" to "Green". */ | ||
TRANSITIONING_TO_GREEN, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what state are we in during shutdown?
...tes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/utils/SpecUtils.java
Outdated
Show resolved
Hide resolved
return objectMapper.writeValueAsString(wrapper); | ||
} catch (JsonProcessingException e) { | ||
throw new RuntimeException( | ||
"Could not serialize " + wrapperKey + ", this indicates a bug...", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should be include object.toString() in the error also?
@@ -101,6 +103,7 @@ rules: | |||
- apiGroups: | |||
- flink.apache.org | |||
resources: | |||
- flinkbluegreendeployments/status | |||
- flinkdeployments/status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this maybe a basic question. Are flinkbluegreendeployments a type of flinkdeployments. Would we expect to see them as flinkdeployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the original idea but we shifted to a has-is
relationship rather than a is-a
one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify this answer @davidradl . For every FlinkBlueGreenDeployment, there will also be 0-2 FlinkDeployment which are the currently running deployments.
Hi @schongloo @davidradl @gyfora , I hope you're doing well. I've been following this PR as blue-green deployments would solve a critical use case for our production needs. I noticed there have been no updates to this PR for a few weeks. Was wondering if there is anything blocking the release of this and if there is something myself or the OSS community could help with (testing, fixing any changes, etc). Thanks for all the work so far! 🙏 |
@maheepm-lyft , @schongloo is currently on holiday but once he is back we can try to merge this as soon as reasonably possible :) In the meantime you can help us by trying out the current version and by providing a code review for the PR, that can definitely speed things up. |
@schongloo can you please re-open this PR against the |
What is the purpose of the change
This pull request adds basic support for Blue/Green deployments as outlined by FLIP-503 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677648).
Brief change log
FlinkBlueGreenDeploymentController
controller, with the capability of managing the lifecycle of these deployments.FlinkBlueGreenDeploymentController
will manage this CRD and hide from the user the details of the actual Blue/Green (Active/StandBy) jobs.FlinkDeployment
controller.Verifying this change
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors
: noDocumentation
docs/content/docs