CloudFormation signaling #1581

gabegorelick · 2021-05-18T19:35:12Z

What I'd like:
It would be nice if there was an easy way to call CloudFormation's SignalResource when booting a Bottlerocket instance. This is typically considered a best practice when creating an ASG in CloudFormation so that it can roll back to an earlier LaunchTemplate or LaunchConfig if the instances don't come online.

See, for example, the ECS CloudFormation reference architecture, which uses the cfn-signal CLI: https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87

In Bottlerocket's case, a typical boot issue I've encountered is passing malformed user data. In such a case, Bottlerocket's early-boot-config.service will fail. But if you don't signal CloudFormation, CloudFormation will still consider the deploy a success, potentially leaving you with no working instances.

Any alternatives you've considered:

Running cfn-signal in a bootstrap container would probably work. But it's not clear to me that bootstrap containers run late enough in the boot sequence to verify that all services are up.

The text was updated successfully, but these errors were encountered:

mello7tre · 2021-07-19T13:57:26Z

Problem is even bigger if you use ASG CreationPolicy:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html

Having no way to signal CF (usually you did this using cfn-signal), you need to either remove policy or set MinSuccessfulInstancesPercent to zero.

mello7tre · 2021-08-01T17:17:22Z

Just one implementation idea:

create a rust program for signaling CF (need to wait that rust sdk implement instance role credential, Instance metadata credentials support smithy-lang/smithy-rs#466)
- program should be configurable by Env variables
create a new settings section, like:

[settings.aws.cloudformation]
"signal" = true/false
"stack-name" = ""
"logical-resource-id" = ""

Create a service, cf-signal.service:

[Unit]
Description=Send signal to CloudFormation Stack
Wants=network-online.target
After=multi-user.target

[Service]
Type=simple
RemainAfterExit=true
EnvironmentFile=/etc/cf-signal.env
ExecStart=/bin/sh -c "STATUS=$(/usr/bin/systemctl --wait is-system-running) /usr/bin/cf-signal"

[Install]
WantedBy=multi-user.target

environment file, cf-signal.env, should look like:

SIGNAL={{settings.aws.cloudformation.signal}}
STACK_NAME={{settings.aws.cloudformation.stack-name}}
LOGICAL_RESOURCE_ID={{settings.aws.cloudformation.logical-resource-id}}

change release.spec to add installation section for cf-signal.env

Note

Systemctl --wait option assure that execution is delayed until boot process is complete.
cf-signal need to use STATUS variable to know if boot has been successful (man systemctl for details):
- running = success
- any other state = failure

webern · 2021-08-02T22:58:32Z

Thank you @gabegorelick for bringing this use case to our attention and @mello7tre for providing a design! We are taking a look at this (both the use case and proposal).

mello7tre · 2021-08-03T13:24:17Z

Just one note:
regarding @gabegorelick specific issue related to the case of a malformed user-data, proposed solution cannot work; and probably there is no solution at all. It's an chicken egg problem.

To signal CloudFormation we need to read user-data to know StackName and LogicalResourceId.

We can acquire those information by looking at instance tags:

aws:cloudformation:logical-id
aws:cloudformation:stack-name

but to do this instances need to have IAM permission:
ec2:DescribeTags

and we cannot presume it.

Details

Cloudformation AutoScalingGroup Resource [ASG] can use two policies:

UpdatePolicy
CreationPolicy

The first is used during a RollingUpdate where ASG instances are replaced with updated ones.
We have the property MinSuccessfulInstancesPercent that specify the percentage of instances that must signal success to consider the update successful.
If an instance do not signal success within the configured time period, it's considered as a failure signal.

The second Policy is used in two different cases:

Creation of a new resource by a replacement update.
Creation of a new resource.

The first use the property MinSuccessfulInstancesPercent as the UpdatePolicy.

The second use the property Count to specify the number of success signal to receive to consider the resource creation successful.
But as AWS documentation we have that:

If the resource receives a failure signal or doesn't receive the specified number of signals before the timeout period expires, the resource creation fails and CloudFormation rolls the stack back.

Just one failure signal is sufficient to consider the creation as failed and default Count value is 1.

Recap

Rolling/Replacement Update of an AutoScalingGroup using an Update/Creation Policy.

If an instance do not signal success within the timeout, CloudFormation consider the instance as a failure (only problem should be that it need to wait longer to know this).

Creation of an AutoScalingGroup using a Creation Policy

if Count property is 0 and we have a malformed user-data.
If Count property is lower than ASG DesiredCapacity and we have a transient problem only on one instance that do not permit proper creation of file with information needed by signal program or if multi-user.target is not activated.

In both cases we can have the problem described by @gabegorelick :

CloudFormation will still consider the deploy a success

as we have no way to signal failure.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.
This way CreationPolicy use the same logic of the other ones (assuming MinSuccessfulInstancesPercent equals to 100).

gabegorelick · 2021-08-05T14:57:42Z

To signal CloudFormation we need to read user-data to know StackName and LogicalResourceId

Sure, but in practice what tends to happen is that if you messed up your instance configuration such that it can't call the CFN API (bad IAM permissions, not passing the correct parameters to SignalResource, etc), you'll just timeout and it will consider it failed. That seems like acceptable behavior for the malformed user data case.

We can acquire those information by looking at instance tags

AFAIK, no Amazon Linux instances do this. They all expect you to pass in the stack and resource names. See https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87, for example. I would expect Bottlerocket to behave similarly, and not do any fancy introspection to determine this info.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.

I'm not sure I understand your point, but IIRC Count is the number of signals each instance must send before it's marked as successfully created. If user data is malformed, you'll get 0 signals and timeout. Whether CFN considers the ASG creation to be a failure at that point depends on MinSuccessfulInstancesPercent.

In any event, this seems like a core CFN question and not specific to Bottlerocket.

mello7tre · 2021-08-05T15:26:29Z

AFAIK, no Amazon Linux instances do this. They all expect you to pass in the stack and resource names. See https://github.com/aws-samples/ecs-refarch-cloudformation/blob/a257e226b33bd9d2a721e5afd9d7e8b66dbacfdc/infrastructure/ecs-cluster.yaml#L87, for example. I would expect Bottlerocket to behave similarly, and not do any fancy introspection to determine this info.

Totally agree with you, maybe i explained bad, but my final words regarding needed permissions:

and we cannot presume it.

where just saying: we cannot used this solution because it need extra permissions.

Only solution is to always set Count CreationPolicy property equals to ASG DesiredCapacity.

I'm not sure I understand your point, but IIRC Count is the number of signals each instance must send before it's marked as successfully created. If user data is malformed, you'll get 0 signals and timeout. Whether CFN considers the ASG creation to be a failure at that point depends on MinSuccessfulInstancesPercent.

Reading AWS documentation seems that MinSuccessfulInstancesPercent is used only for an Auto Scaling replacement update when a WillReplace policy is used.
And not for the first creation of an ASG.
But it's not clear if Count is used too (only way to know this is by experiment).
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

In any event, this seems like a core CFN question and not specific to Bottlerocket.

Partially.
I think should be clear if a solution is adopted, which events it can cover.
And it should help to choose when/how to trigger the cf-signal service too.
I put it in multi-user.target. But probably should be better to insert it previous in the boot chain, just as the user-data is processed and the environment file is created, to be able to cover more failing events.

mello7tre · 2021-08-08T07:47:10Z

Made some tests, mine assumptions where wrong, but..
AWS documentation is misleading and partially wrong too.

Whole Creation Policy is used for both: ASG new creation and ASG update using Replacement Update (that after all is just a new ASG creation followed by the deletion of the old one).

But:
Count represent the number of signal that need to be received both success or failure (and not only success).
Creation do not automatically fail for just one FAILURE signal!
Lack of a success signal in Timeout is considered a FAILURE.

In detail:

When creating an ASG Cloudformation wait until it receive Count signals (success or failure) or until Timeout time ends.
Once that happen it process received signal taking in account MinSuccessfulInstancesPercent and choose if creation as been successful or not.

Ex. (Timeout = 20m)

Count = 2 and MinSuccessfulInstancesPercent = 100
- we send 2 SUCCESS signal . No wait and creation is successful
- we send only 1 SUCCESS signal . We need to wait 20min and creation is failed.
- we send 1 SUCCESS and 1 FAILURE. No wait and creation is failed.
Count = 2 and MinSuccessfulInstancesPercent = 50
- we send 2 SUCCESS signal . No wait and creation is successful
- we send only 1 SUCCESS signal . We need to wait 20min but creation is successful
- we send 1 SUCCESS and 1 FAILURE. No wait and creation is successful.

Recap:

Only difference in signaling a FAILURE or not signaling at all is the time needed to wait.

So bottlerocket signaling implementation should focus on signaling a success if all goes well so that ASG creation should complete.
If we are able to cover all failure events too, much better, but if not (Ex malformed user-data) only difference is that ASG creation will fail after Timeout expire.

samuelkarp · 2021-08-10T23:58:31Z

Thank you @gabegorelick for opening this issue and @mello7tre for providing a design! We’re really glad to see so much excitement around this enhancement.

@mello7tre’s design looks fairly straightforward to me. I can understand concern around failing quickly rather than waiting for a timeout, especially in the case of configuration applied via bootstrap containers rather than settings. One possible way for a bootstrap container to indicate that it has failed to complete configuration of the host (for example, formatting and mounting block devices) might be to add an additional settings.aws.cloudformation.success or settings.aws.cloudformation.signal-value setting. If the bootstrap container failed, it could flip this setting to false (or failure) to indicate that the FAILURE signal should be sent rather than the SUCCESS signal.

Would either of you be interested in contributing to Bottlerocket and implementing this feature? We’d be happy to assist if you run into any roadblocks with it.

mello7tre · 2021-08-12T07:46:21Z

Thanks @samuelkarp for the offer.
But at the moment i have some personal family problems that take on all my spare time.

Before begin to apply mine implementation ideas i need to build a vanilla bottlerocket and this is on my todo list from some time, but i had no time to do it.

Second, as i said, i think the best approach should be to write a little rust program for signaling, we need just few lines of code, but problem is that rust-sdk at the moment do not support getting credentials from instance role.

I will continue to follow this issue, and if i will be able to find some time i will begin building bottlerocket and doing some experimenting.
But, at the moment, i am not the right person for a quick development.

samuelkarp · 2021-08-12T23:02:37Z

Thanks for letting us know! We'll update this issue when we're able to start work on it, but in the meantime if anyone is interested in contributing here please let us know.

mello7tre · 2021-08-31T15:08:34Z

just an update:
I had some time to build bottlerocket and begin experimenting.
Looking at metricdog, i have seen that you already execute systemctl commands, so best approach is to call systemctl --wait is-system-running directly inside cfsignal program.
I removed the systemd.unit environment file too and use a cfsignal.toml configured by reading the user-data.

But i am not a rust programmer, i am a cloud architect and devops.
At the moment i have a very basic running program "inspired" by metricdog code but when i will open a PR a rust expert should give it a look and make the relative needed changes (and have pity of the code i wrote).

I had to use rusoto in place of official alfa aws-rust-sdk for 2 problems:

aws-rust-sdk do not still support getting credentials from instance role (metadata endpoint).
aws-rust-sdk need hyper versions >= 0.14.3 and this conflict with bottlerocket issue Build does not work with hyper versions >= 0.14.3 #1471

I am still doing some tests to check when signals are sent and to find out when during the boot process we are able to send a FAILURE signal.
I will update you on this.

mello7tre · 2021-09-01T12:16:54Z

cfsignal need configured toml file, so it depends on settings-applier.service.
It cannot send a signal for a failure happening before settings-applier.service and network-online.target are started.

It is able to send a failure signal for any other service starting from (included):
activate-multi-user.service

i am ready to open a RP.

rcoh · 2021-09-23T14:48:31Z

Quick update from the SDK: this will go out in v0.0.19-alpha either late this week or early next.

mello7tre · 2021-09-23T15:10:15Z

Thanks, i think you are talking about point 1 of:

aws-rust-sdk do not still support getting credentials from instance role (metadata endpoint).
aws-rust-sdk need hyper versions >= 0.14.3 and this conflict with bottlerocket issue Build does not work with hyper versions >= 0.14.3 #1471

unfortunately issue reported in point 2 have not been fixed and looking at the comment rust-lang/rust#82151 solution is not on the way...

as soon as point 2 will be fixed, i will change the current cfsignal code (that still need to be approved/merged) to use aws-rust-sdk in place of rusoto.

jhuntwork · 2021-12-31T20:19:33Z

Really nice work on this, good to see some forward momentum! Just curious what's left before this is usable?

gabegorelick · 2022-02-01T23:04:18Z

It is able to send a failure signal for any other service starting from (included):
activate-multi-user.service

Does this include sending a failure signal if we couldn't successfully join a Kubernetes or ECS cluster? That's the main thing I'm looking for in this feature.

mello7tre · 2022-02-03T08:18:39Z

It should.
Both ECS and Kube are WantedBy multi-user.target and depend on configured.target
CfnSignal service is WantedBy preconfigured.target and depend on network-online.target settings-applier.service.
If you look at:
https://github.com/bottlerocket-os/bottlerocket/tree/develop/sources/api
configured.target depends on settings-applier and represents the point at which the system is fully configured.
So in the worst case cfsignal and configured.target should be started at the same time.
But cfsignal should always be started before multi-user.target wanted services.

In the past i have done some test making activate-multi-user.service fail, and cfsignal properly signaled the ASG that instance have failed.
So cfsignal signaling should work for every service started by systemd and wanted by multi-user.service.

gabegorelick · 2022-02-03T16:37:06Z

Both ECS and Kube are WantedBy multi-user.target and depend on configured.target

Do those services reliably fail when they can't join a cluster, or do they retry indefinitely?

mello7tre · 2022-02-04T09:04:03Z

dunno about kube.
For ECS i made a test using a non existent cluster-name in ecs configuration and ecs service fail at systemd level, so in that case cfsignal should work.
I do not think that there could be another configuration that could make joining a cluster fail apart non existing cluster...
(i tried to put non existent option in /etc/ecs/ecs.config and ecs service/agent simply ignore them)

gabegorelick · 2022-02-04T19:56:28Z

Another use case: waiting for instances to register and be healthy with a load balancer. Similar to what's mentioned in https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-creationpolicy.html:

To have instances wait for an Elastic Load Balancing health check before they signal success, add a health-check verification by using the cfn-init helper script. For an example, see the verify_instance_health command in the Auto Scaling rolling updates sample template.

mello7tre · 2022-02-07T08:36:59Z

yes, but usually the instances of an ecs cluster do not directly register with a LB, are the ecs services running on them that do this.

gabegorelick · 2022-02-07T15:43:05Z

yes, but usually the instances of an ecs cluster do not directly register with a LB, are the ecs services running on them that do this.

True, that is the standard setup. In theory you could use instance targets without ECS managing it, although I don't know if anyone ever does that (I certainly have never).

But for Kubernetes, it's definitely reasonable to have instance targets that directly register with the LB, e.g. to expose a NodePort service.

mello7tre · 2022-02-07T17:14:08Z

maybe in future, if PR will ever be merged, can be added a settings to manually specify a target-group-arn to query using elbv2 describe-target-health (i currently do this for other EC2 Stacks using cfn-init).
(if i have correctly understood your suggestion...)

gabegorelick · 2022-02-07T19:21:34Z

maybe in future, if PR will ever be merged, can be added a settings to manually specify a target-group-arn to query using elbv2 describe-target-health (i currently do this for other EC2 Stacks using cfn-init)

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

gabegorelick · 2022-02-08T17:58:08Z

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

One small hiccup: I think Bottlerocket restarts enabled host containers indefinitely, which is not what I want.

etungsten · 2022-02-09T17:44:20Z

For now, I've resorted to running a custom host container to accomplish this. So far it seems to be working.

One small hiccup: I think Bottlerocket restarts enabled host containers indefinitely, which is not what I want.

One thing I believe you can do is to set settings.host-containers.<your-custom-container>.enabled to false from within your custom host container once it's done with its work.
All custom host containers mount in the Bottlerocket API socket and the apiclient, see:

bottlerocket/sources/host-ctr/cmd/host-ctr/main.go

Lines 678 to 690 in 9749af9

    
           // Mount in the API socket for the Bottlerocket API server, and the API 
        
           // client used to interact with it 
        
           { 
        
           	Options:     []string{"bind", "rw"}, 
        
           	Destination: "/run/api.sock", 
        
           	Source:      "/run/api.sock", 
        
           }, 
        
           // Mount in the apiclient to make API calls to the Bottlerocket API server 
        
           { 
        
           	Options:     []string{"bind", "ro"}, 
        
           	Destination: "/usr/local/bin/apiclient", 
        
           	Source:      "/usr/bin/apiclient", 
        
           },

So within your custom host container, you can run something like apiclient set settings.host-containers.<custom-host-container>.enabled=false to prevent the host-container from restarting again. Lemme know if that works for you!

gabegorelick · 2022-02-09T18:09:49Z

One thing I believe you can do is to set settings.host-containers..enabled to false from within your custom host container once it's done with its work.

Would that disable the host container for all future instances, or just the enclosing instance? I need future instance rollouts to still run that host container.

etungsten · 2022-02-09T18:15:27Z

One thing I believe you can do is to set settings.host-containers..enabled to false from within your custom host container once it's done with its work.

Would that disable the host container for all future instances, or just the enclosing instance? I need future instance rollouts to still run that host container.

It would just be for that enclosing instance. Each instance has their own set of settings they configure and use.

Created a new rust program, cfsignal to send signal to CloudFormation Stack. Program is a sort of cfn-signal https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html but as cfn-signal need python cannot be used by bottlerocket. cfsignal read configuration from a cfsignal.toml file configured reading user-data, so it depends on settings-applier.service. It cannot send a signal for a failure happening before settings-applier.service and network-online.target are started. It is able to send a failure signal for any other service starting from (included): activate-multi-user.service It use systemctl action is-system-running with --wait option. This way we can know if any service, after systemd boot process finished, is in a failure status. Requested changes: * removed author * signal parameter renamed to should_signal (is more specific that should_send) * added README.md * removed commented out lines * use imdsclient in place of ec2_instance_metadata * refactor service_check.rs and renamed to system_check.rs use weak dependency (WantedBy)for cfsignal.service use tokio LTS, only with needed features restart command some code refactor * use directly signal_resource as function * code simplification in system_check.rs * use standard boilerplate for main function semaphore file and migration * Use semaphore file to only run on first boot * Add migration file for downgrading * client.signal_resource collapsed * Fix to packages/os/os.spec: toml file is not copyed (introduced during rebase) Readme changes

CloudFormation signal program (issue #1581)

samuelkarp added status/needs-triage Pending triage or re-evaluation type/enhancement New feature or request labels May 18, 2021

jpculp added this to the oncall milestone Jul 8, 2021

jpculp added area/core Issues core to the OS (variant independent) priority/p1 status/research This issue is being researched and removed status/needs-triage Pending triage or re-evaluation labels Jul 8, 2021

jhaynes added status/notstarted and removed status/research This issue is being researched labels Jul 19, 2021

jhaynes modified the milestones: oncall, backlog Jul 19, 2021

samuelkarp modified the milestones: backlog, next Aug 5, 2021

mello7tre mentioned this issue Aug 31, 2021

prairiedog build fail caused by updated argh_derive version #1727

Closed

mello7tre mentioned this issue Sep 1, 2021

CloudFormation signal program (issue #1581) #1728

Merged

kdaula removed this from the next milestone Feb 4, 2022

etungsten added a commit that referenced this issue Mar 5, 2022

Merge pull request #1728 from mello7tre/feat/cloudformation-signal

92bdb47

CloudFormation signal program (issue #1581)

etungsten closed this as completed Mar 15, 2022

stockholmux removed priority/p1 labels Jun 10, 2022

CloudFormation signaling #1581

CloudFormation signaling #1581

Comments

gabegorelick commented May 18, 2021

mello7tre commented Jul 19, 2021

Uh oh!

mello7tre commented Aug 1, 2021

Note

Uh oh!

webern commented Aug 2, 2021

Uh oh!

mello7tre commented Aug 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Recap

Rolling/Replacement Update of an AutoScalingGroup using an Update/Creation Policy.

Creation of an AutoScalingGroup using a Creation Policy

Uh oh!

gabegorelick commented Aug 5, 2021

Uh oh!

mello7tre commented Aug 5, 2021

Uh oh!

mello7tre commented Aug 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

In detail:

Recap:

Uh oh!

samuelkarp commented Aug 10, 2021

Uh oh!

mello7tre commented Aug 12, 2021

Uh oh!

samuelkarp commented Aug 12, 2021

Uh oh!

mello7tre commented Aug 31, 2021

Uh oh!

mello7tre commented Sep 1, 2021

Uh oh!

rcoh commented Sep 23, 2021

Uh oh!

mello7tre commented Sep 23, 2021

Uh oh!

jhuntwork commented Dec 31, 2021

Uh oh!

gabegorelick commented Feb 1, 2022

Uh oh!

mello7tre commented Feb 3, 2022

Uh oh!

gabegorelick commented Feb 3, 2022

Uh oh!

mello7tre commented Feb 4, 2022

Uh oh!

gabegorelick commented Feb 4, 2022

Uh oh!

mello7tre commented Feb 7, 2022

Uh oh!

gabegorelick commented Feb 7, 2022

Uh oh!

mello7tre commented Feb 7, 2022

Uh oh!

gabegorelick commented Feb 7, 2022

Uh oh!

gabegorelick commented Feb 8, 2022

Uh oh!

etungsten commented Feb 9, 2022

Uh oh!

gabegorelick commented Feb 9, 2022

Uh oh!

etungsten commented Feb 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mello7tre commented Aug 3, 2021 •

edited

Loading

mello7tre commented Aug 8, 2021 •

edited

Loading

etungsten commented Feb 9, 2022 •

edited

Loading