-
Notifications
You must be signed in to change notification settings - Fork 532
CloudFormation signaling #1581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Problem is even bigger if you use ASG CreationPolicy: Having no way to signal CF (usually you did this using cfn-signal), you need to either remove policy or set MinSuccessfulInstancesPercent to zero. |
Just one implementation idea:
Note
|
Thank you @gabegorelick for bringing this use case to our attention and @mello7tre for providing a design! We are taking a look at this (both the use case and proposal). |
Just one note: To signal CloudFormation we need to read We can acquire those information by looking at instance tags:
but to do this instances need to have IAM permission: and we cannot presume it. DetailsCloudformation AutoScalingGroup Resource [ASG] can use two policies:
The first is used during a The second Policy is used in two different cases:
The first use the property The second use the property
Just one failure signal is sufficient to consider the creation as failed and default RecapRolling/Replacement Update of an AutoScalingGroup using an Update/Creation Policy.If an instance do not signal success within the timeout, CloudFormation consider the instance as a failure (only problem should be that it need to wait longer to know this). Creation of an AutoScalingGroup using a Creation Policy
In both cases we can have the problem described by @gabegorelick :
as we have no way to signal failure. Only solution is to always set |
Sure, but in practice what tends to happen is that if you messed up your instance configuration such that it can't call the CFN API (bad IAM permissions, not passing the correct parameters to SignalResource, etc), you'll just timeout and it will consider it failed. That seems like acceptable behavior for the malformed user data case.
AFAIK, no Amazon Linux instances do this. They all expect you to pass in the stack and resource names. See https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87, for example. I would expect Bottlerocket to behave similarly, and not do any fancy introspection to determine this info.
I'm not sure I understand your point, but IIRC In any event, this seems like a core CFN question and not specific to Bottlerocket. |
Totally agree with you, maybe i explained bad, but my final words regarding needed permissions:
where just saying: we cannot used this solution because it need extra permissions.
Reading AWS documentation seems that
Partially. |
Made some tests, mine assumptions where wrong, but.. Whole But: In detail:When creating an ASG Cloudformation wait until it receive Ex. (
Recap:Only difference in signaling a FAILURE or not signaling at all is the time needed to wait. So bottlerocket signaling implementation should focus on signaling a success if all goes well so that ASG creation should complete. |
Thank you @gabegorelick for opening this issue and @mello7tre for providing a design! We’re really glad to see so much excitement around this enhancement. @mello7tre’s design looks fairly straightforward to me. I can understand concern around failing quickly rather than waiting for a timeout, especially in the case of configuration applied via bootstrap containers rather than settings. One possible way for a bootstrap container to indicate that it has failed to complete configuration of the host (for example, formatting and mounting block devices) might be to add an additional Would either of you be interested in contributing to Bottlerocket and implementing this feature? We’d be happy to assist if you run into any roadblocks with it. |
Thanks @samuelkarp for the offer. Before begin to apply mine implementation ideas i need to build a vanilla bottlerocket and this is on my todo list from some time, but i had no time to do it. Second, as i said, i think the best approach should be to write a little rust program for signaling, we need just few lines of code, but problem is that rust-sdk at the moment do not support getting credentials from instance role. I will continue to follow this issue, and if i will be able to find some time i will begin building bottlerocket and doing some experimenting. |
Thanks for letting us know! We'll update this issue when we're able to start work on it, but in the meantime if anyone is interested in contributing here please let us know. |
just an update: But i am not a rust programmer, i am a cloud architect and devops. I had to use rusoto in place of official alfa aws-rust-sdk for 2 problems:
I am still doing some tests to check when signals are sent and to find out when during the boot process we are able to send a FAILURE signal. |
cfsignal need configured toml file, so it depends on settings-applier.service. It is able to send a failure signal for any other service starting from (included): i am ready to open a RP. |
Quick update from the SDK: this will go out in v0.0.19-alpha either late this week or early next. |
Thanks, i think you are talking about point 1 of:
unfortunately issue reported in point 2 have not been fixed and looking at the comment rust-lang/rust#82151 solution is not on the way... as soon as point 2 will be fixed, i will change the current cfsignal code (that still need to be approved/merged) to use |
Really nice work on this, good to see some forward momentum! Just curious what's left before this is usable? |
Does this include sending a failure signal if we couldn't successfully join a Kubernetes or ECS cluster? That's the main thing I'm looking for in this feature. |
It should. In the past i have done some test making |
Do those services reliably fail when they can't join a cluster, or do they retry indefinitely? |
dunno about kube. |
Another use case: waiting for instances to register and be healthy with a load balancer. Similar to what's mentioned in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html:
|
yes, but usually the instances of an ecs cluster do not directly register with a LB, are the ecs services running on them that do this. |
True, that is the standard setup. In theory you could use instance targets without ECS managing it, although I don't know if anyone ever does that (I certainly have never). But for Kubernetes, it's definitely reasonable to have instance targets that directly register with the LB, e.g. to expose a NodePort service. |
maybe in future, if PR will ever be merged, can be added a settings to manually specify a |
For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working. |
One small hiccup: I think Bottlerocket restarts enabled host containers indefinitely, which is not what I want. |
One thing I believe you can do is to set bottlerocket/sources/host-ctr/cmd/host-ctr/main.go Lines 678 to 690 in 9749af9
So within your custom host container, you can run something like |
Would that disable the host container for all future instances, or just the enclosing instance? I need future instance rollouts to still run that host container. |
It would just be for that enclosing instance. Each instance has their own set of settings they configure and use. |
Created a new rust program, cfsignal to send signal to CloudFormation Stack. Program is a sort of cfn-signal https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html but as cfn-signal need python cannot be used by bottlerocket. cfsignal read configuration from a cfsignal.toml file configured reading user-data, so it depends on settings-applier.service. It cannot send a signal for a failure happening before settings-applier.service and network-online.target are started. It is able to send a failure signal for any other service starting from (included): activate-multi-user.service It use systemctl action is-system-running with --wait option. This way we can know if any service, after systemd boot process finished, is in a failure status. Requested changes: * removed author * signal parameter renamed to should_signal (is more specific that should_send) * added README.md * removed commented out lines * use imdsclient in place of ec2_instance_metadata * refactor service_check.rs and renamed to system_check.rs use weak dependency (WantedBy)for cfsignal.service use tokio LTS, only with needed features restart command some code refactor * use directly signal_resource as function * code simplification in system_check.rs * use standard boilerplate for main function semaphore file and migration * Use semaphore file to only run on first boot * Add migration file for downgrading * client.signal_resource collapsed * Fix to packages/os/os.spec: toml file is not copyed (introduced during rebase) Readme changes
CloudFormation signal program (issue #1581)
What I'd like:
It would be nice if there was an easy way to call CloudFormation's
SignalResource
when booting a Bottlerocket instance. This is typically considered a best practice when creating an ASG in CloudFormation so that it can roll back to an earlier LaunchTemplate or LaunchConfig if the instances don't come online.See, for example, the ECS CloudFormation reference architecture, which uses the
cfn-signal
CLI: https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87In Bottlerocket's case, a typical boot issue I've encountered is passing malformed user data. In such a case, Bottlerocket's
early-boot-config.service
will fail. But if you don't signal CloudFormation, CloudFormation will still consider the deploy a success, potentially leaving you with no working instances.Any alternatives you've considered:
Running
cfn-signal
in a bootstrap container would probably work. But it's not clear to me that bootstrap containers run late enough in the boot sequence to verify that all services are up.The text was updated successfully, but these errors were encountered: