-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: add Fleet ID to error logs for CloudTrail correlation #8547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
|
Preview deployment ready! Preview URL: https://pr-8547.d18coufmbnnaag.amplifyapp.com Built from commit |
|
@youwalther65 Thank you for pointing that out! You're absolutely right. I've already addressed the - c.unavailableOfferingsCache.MarkUnavailable(ctx, string(msg.Kind()), ec2types.InstanceType(instanceType), zone, karpv1.CapacityTypeSpot)
+ c.unavailableOfferingsCache.MarkUnavailable(ctx, string(msg.Kind()), ec2types.InstanceType(instanceType), zone, karpv1.CapacityTypeSpot, "") |
|
@youwalther65, @sumukha-radhakrishna Fleet ID is only meaningful for What do you think about this approach? Do you have any better ideas or suggestions on how we could handle this more elegantly? |
@DerekFrank What's your opinion on that? |
|
@youwalther65 Thank you for the suggestion! You're absolutely right. I've refactored the implementation to only log Fleet ID for This preserves the CloudTrail correlation capability while keeping the design cleaner and more maintainable. Please let me know if this aligns with what you had in mind! |
|
Thx @moko-poi Let's hear what @DerekFrank thinks about the different approaches, as he is one of the maintainers. |
|
Hey folks! I think we have two goals
I see two solutions:
I think I prefer the latter. What do you guys think? |
|
@DerekFrank Great suggestions, thx. |
DerekFrank
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in conversation
|
@DerekFrank @youwalther65 |
|
Implemented option 2 as recommended
Ready for re-review! |
pkg/cache/unavailableofferings.go
Outdated
|
|
||
| // MarkUnavailable communicates recently observed temporary capacity shortages in the provided offerings | ||
| func (u *UnavailableOfferings) MarkUnavailable(ctx context.Context, unavailableReason string, instanceType ec2types.InstanceType, zone, capacityType string) { | ||
| func (u *UnavailableOfferings) MarkUnavailable(ctx context.Context, instanceType ec2types.InstanceType, zone, capacityType string, extraContext map[string]interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a map[string]interface? Can it be a map[string][string]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep the name unavailableReason and just make it a map[string][string]? extraContext don't tell that much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b088159
Thanks for the feedback! I've addressed all the points you raised
Changed extraContext map[string]interface{} to unavailableReason map[string]string as suggested.
pkg/cache/unavailableofferings.go
Outdated
|
|
||
| // Add extra context if provided and not empty | ||
| for k, v := range extraContext { | ||
| if v != nil && v != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where a nil value can get in, is this just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @DerekFrank: We call it for Spot interruptions and InsufficientOnstnaceCapacity, always non-empty reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6462ccd
You're absolutely right! I've removed both the nil check and the empty string check.
| } | ||
| } | ||
|
|
||
| log.FromContext(ctx).WithValues(logValues...).V(1).Info("removing offering from offerings") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't need the nil check, I think you might be able to make this section cleaner by doing something like:
log.FromContext(ctx).WithValues(
"instance-type", instanceType,
"zone", zone,
"capacity-type", capacityType,
"ttl", UnavailableOfferingsTTL,
extraContext...).V(1).Info("removing offering from offerings")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DerekFrank
Thanks for the suggestion! I really like the cleaner approach you're proposing. However, there's a Go language limitation here that prevents us from using the spread operator directly with maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the limitation? I'm sure there's a cleaner way to do this.
You might need to change logValues to a []string{}, but I'm very surprised this doesn't work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DerekFrank I tried to do other approaches as well but those failed. Got errors like {"level":"DPANIC",..."logger":"controller","caller":"cache/unavailableofferings.go:93","message":"odd number of arguments passed as key-value pairs for logging" because of the way log.FromContext(ctx).WithValuesexpects key-value pairs.
I feel the approach from @moko-poi is valid.
Using a Fault Injection Simulator (FIS) experiment to simulate Spot interruption and a local, patched Karpenter controller with make run, I got the expected log line:
{"level":"DEBUG","time":"2025-10-24T09:10:37.127+0200","logger":"controller","caller":"cache/unavailableofferings.go:105","message":"removing offering from **offerings","controller":"interruption","namespace":"","name":"","reconcileID":"ae3b097e-b068-4d16-bb85-6a8857b5b675","queue":"karpenter-demo","messageKind":"spot_interrupted","NodeClaim":{"name":"default-psv6m"},"action":"CordonAndDrain","Node":{"name":"i-0d4<redacted>"},"instance-type":"c5a.xlarge","zone":"eu-west-1a","capacity-type":"spot","ttl":"3m0s","reason":"spot_interrupted"}
But there is one drawback with this approach though: For InsufficientInstanceCapacity the map contains two key (reason and fleet-id) and because it's a map the range loop might insert these keys in different order, which isn't nice for log parsing (one time the reason might come first, another time the fleet-id).
So we should insert the values in a predictable order. Something like:
// Add "reason" and "fleet-id" if provided
unavailableKeys := []string{"reason", "fleet-id"}
for _, key := range unavailableKeys {
_, ok := unavailableReason[key]
if ok {
logValues = append(logValues, key, unavailableReason[key])
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we just sort the keys prior to logging. Something like:
keys := lo.Keys(logValues)
slices.Sort(keys)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DerekFrank But then these logs will have a different order compared to other logs.
I'd rather prefer to keep the original order, which is "instance-type", "zone", "capacity-type", "ttl", "reason" and now in addition "fleet-id" in case of InsufficientInstance Capacity.
Or we go back to oneof the initial ideas and move the log.FromContext(ctx).WithValuecompletely out of MarkUnavailable function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new consolidated PR Alternate approach to solve #8482 based on ideas from moko-poi in PR #8547 #8684.
@DerekFrank @moko-poi Please take a look, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep the order of log entries for backwards compatibility we could use:
logValues := []interface{}{
"reason", unavailableReason["reason"],
"instance-type", instanceType,
"zone", zone,
"capacity-type", capacityType,
"ttl", UnavailableOfferingsTTL,
}
// Add fleetID if provided
key := "fleet-id"
_, ok := unavailableReason[key]
if ok {
logValues = append(logValues, key, unavailableReason[key])
}
log.FromContext(ctx).WithValues(logValues...).V(1).Info("removing offering from offerings")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense. Since WithValues expects even key/value pairs and Go doesn’t allow spreading a map, this version keeps the log stable and avoids DPANICs.
Let’s go with the fixed-order slice approach as in #8684 — it keeps the output consistent and backward compatible.
pkg/cache/unavailableofferings.go
Outdated
| instanceType := fleetErr.LaunchTemplateAndOverrides.Overrides.InstanceType | ||
| zone := aws.ToString(fleetErr.LaunchTemplateAndOverrides.Overrides.AvailabilityZone) | ||
| u.MarkUnavailable(ctx, lo.FromPtr(fleetErr.ErrorCode), instanceType, zone, capacityType) | ||
| extraContext := map[string]interface{}{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could this be inlined into the call below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e45c7d8
Done! Inlined the map into the call.
pkg/cache/unavailableofferings.go
Outdated
| u.offeringCacheSeqNumMu.Unlock() | ||
| } | ||
|
|
||
| func (u *UnavailableOfferings) MarkUnavailableForFleetErr(ctx context.Context, fleetErr ec2types.CreateFleetError, capacityType string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c845300
You're absolutely right - this function wasn't needed.
pkg/providers/instance/instance.go
Outdated
| return ec2types.CreateFleetInstance{}, cloudprovider.NewCreateError(fmt.Errorf("creating fleet request, %w", err), reason, fmt.Sprintf("Error creating fleet request: %s", message)) | ||
| } | ||
| p.updateUnavailableOfferingsCache(ctx, createFleetOutput.Errors, capacityType, nodeClaim, instanceTypes) | ||
| fleetID := aws.ToString(createFleetOutput.FleetId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I prefer to inline these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! You're right, that variable was only used once.
|
Thanks for the work on this, I think we're very close to a fantastic solution! |
|
@DerekFrank @youwalther65 The code is now more direct, type-safe, and easier to understand. Thanks for pushing for these improvements - the end result is much cleaner! 🚀 |
Thank you @moko-poi ! |
|
5bffe50 |
Fixes #8482
Description
This change adds Fleet ID to error logs in the instance provider to enable easier correlation with AWS CloudTrail events during troubleshooting. When EC2 Fleet creation fails or instances become unavailable, the Fleet ID is now included in the log output alongside existing fields like instance-type, zone, and capacity-type.
Changes:
fleetIDparameter toMarkUnavailableandMarkUnavailableForFleetErrfunctionsfleet-idfield for unavailable offeringsThis enhancement improves operational visibility by allowing engineers to quickly cross-reference Karpenter error logs with corresponding AWS CloudTrail events, significantly reducing troubleshooting time for capacity-related issues.
How was this change tested?
pkg/cacheandpkg/providers/instance/filterDoes this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.