Skip to content

Conversation

@moko-poi
Copy link

Fixes #8482

Description

This change adds Fleet ID to error logs in the instance provider to enable easier correlation with AWS CloudTrail events during troubleshooting. When EC2 Fleet creation fails or instances become unavailable, the Fleet ID is now included in the log output alongside existing fields like instance-type, zone, and capacity-type.

Changes:

  • Added fleetID parameter to MarkUnavailable and MarkUnavailableForFleetErr functions
  • Updated error logs to include fleet-id field for unavailable offerings
  • Modified function signatures in the unavailable offerings cache
  • Updated all test cases to accommodate the new function signatures

This enhancement improves operational visibility by allowing engineers to quickly cross-reference Karpenter error logs with corresponding AWS CloudTrail events, significantly reducing troubleshooting time for capacity-related issues.

How was this change tested?

  • Updated existing unit tests in pkg/cache and pkg/providers/instance/filter
  • Verified that all tests pass with the new function signatures
  • Confirmed that Fleet ID appears correctly in log output during test runs

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@moko-poi moko-poi requested a review from a team as a code owner September 30, 2025 16:40
@netlify
Copy link

netlify bot commented Sep 30, 2025

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 5bffe50
🔍 Latest deploy log https://app.netlify.com/projects/karpenter-docs-prod/deploys/68eb977030eb8000086205b2

@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2025

Preview deployment ready!

Preview URL: https://pr-8547.d18coufmbnnaag.amplifyapp.com

Built from commit 5bffe508654eed845882cc5ab3e016898e08d201

@youwalther65
Copy link
Contributor

@moko-poi /pkg/controllers/interruption//controller.go calls MarkUnavailable as well and needs a change here.

@moko-poi
Copy link
Author

moko-poi commented Oct 2, 2025

@youwalther65 Thank you for pointing that out! You're absolutely right.

I've already addressed the MarkUnavailable call in /pkg/controllers/interruption/controller.go as part of this PR. The change adds an empty string "" as the fleetID parameter since Fleet IDs don't exist in interruption contexts:

fa9a892

- c.unavailableOfferingsCache.MarkUnavailable(ctx, string(msg.Kind()), ec2types.InstanceType(instanceType), zone, karpv1.CapacityTypeSpot)
+ c.unavailableOfferingsCache.MarkUnavailable(ctx, string(msg.Kind()), ec2types.InstanceType(instanceType), zone, karpv1.CapacityTypeSpot, "")

@moko-poi
Copy link
Author

moko-poi commented Oct 2, 2025

@youwalther65, @sumukha-radhakrishna
I have some concerns about adding the fleetID parameter to the general MarkUnavailable method.

Fleet ID is only meaningful for IsUnfulfillableCapacity errors from Fleet API calls, but we're forcing all callers (including interruption events like SpotInterruptionKind) to provide this parameter, even when it doesn't make sense in their context.

What do you think about this approach? Do you have any better ideas or suggestions on how we could handle this more elegantly?

@youwalther65
Copy link
Contributor

@youwalther65, @sumukha-radhakrishna I have some concerns about adding the fleetID parameter to the general MarkUnavailable method.

Fleet ID is only meaningful for IsUnfulfillableCapacity errors from Fleet API calls, but we're forcing all callers (including interruption events like SpotInterruptionKind) to provide this parameter, even when it doesn't make sense in their context.

What do you think about this approach? Do you have any better ideas or suggestions on how we could handle this more elegantly?

@DerekFrank What's your opinion on that?
@moko-poi My initial thought was just to log the FleetID in case we get IsUnfulfillableCapacity i.e. extending the log we already have for "reason":"InsufficientInstanceCapacity" with "fleetID":"fleet-XXX" like I mentioned here, not necessarily putting the fleetID into the cache.

@moko-poi
Copy link
Author

moko-poi commented Oct 2, 2025

@youwalther65 Thank you for the suggestion!

You're absolutely right. I've refactored the implementation to only log Fleet ID for IsUnfulfillableCapacity errors, rather than adding it to the cache method signatures.
8c0d4a8

This preserves the CloudTrail correlation capability while keeping the design cleaner and more maintainable. Please let me know if this aligns with what you had in mind!

@youwalther65
Copy link
Contributor

youwalther65 commented Oct 2, 2025

Thx @moko-poi
The only "flaw" I see now is, that we have a duplicate log lines, one with fleetID and one without fleetid because you have to call p.unavailableOfferings.MarkUnavailable. This might be somehow "expensive" (duplicate logging) and confuse users.
So my favourite is the previous commit fa9a892 and we just could life with the empty string in pkg/controllers/interruption/controller.go.

Let's hear what @DerekFrank thinks about the different approaches, as he is one of the maintainers.

@DerekFrank
Copy link
Contributor

Hey folks! I think we have two goals

  1. Lets not double-log lines
  2. Lets not log empty information when we don't have it

I see two solutions:

  1. If we're gonna move removing offering from offerings outside of MarkUnavailable() lets move it entirely out of MarkUnavailable() and remove unavailableReason as well.
  2. If we want the log line to be in MarkUnavailable, lets remove unavailableReason and replace it with a map of arbitrary context that gets logged, and pass in unavailableReason and optionally fleetId

I think I prefer the latter. What do you guys think?

@youwalther65
Copy link
Contributor

@DerekFrank Great suggestions, thx.
I completely agree with the goals.
In terms of what to prefer, I would suggest to take the logging approach consistent with other types of logging in the project.
I would also prefer the second option.

Copy link
Contributor

@DerekFrank DerekFrank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in conversation

@moko-poi
Copy link
Author

moko-poi commented Oct 3, 2025

@DerekFrank @youwalther65
Thank you very much for your suggestions.
I also agree with option 2 and will proceed with the implementation in this direction.

@moko-poi moko-poi requested a review from DerekFrank October 4, 2025 12:11
@moko-poi
Copy link
Author

moko-poi commented Oct 4, 2025

@DerekFrank @youwalther65

Implemented option 2 as recommended

  • Eliminated duplicate logging
  • Automatic filtering of empty Fleet ID values
  • Flexible extraContext map[string]interface{} approach

Ready for re-review!


// MarkUnavailable communicates recently observed temporary capacity shortages in the provided offerings
func (u *UnavailableOfferings) MarkUnavailable(ctx context.Context, unavailableReason string, instanceType ec2types.InstanceType, zone, capacityType string) {
func (u *UnavailableOfferings) MarkUnavailable(ctx context.Context, instanceType ec2types.InstanceType, zone, capacityType string, extraContext map[string]interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a map[string]interface? Can it be a map[string][string]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the name unavailableReason and just make it a map[string][string]? extraContext don't tell that much

Copy link
Author

@moko-poi moko-poi Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b088159
Thanks for the feedback! I've addressed all the points you raised
Changed extraContext map[string]interface{} to unavailableReason map[string]string as suggested.


// Add extra context if provided and not empty
for k, v := range extraContext {
if v != nil && v != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where a nil value can get in, is this just in case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @DerekFrank: We call it for Spot interruptions and InsufficientOnstnaceCapacity, always non-empty reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6462ccd
You're absolutely right! I've removed both the nil check and the empty string check.

}
}

log.FromContext(ctx).WithValues(logValues...).V(1).Info("removing offering from offerings")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't need the nil check, I think you might be able to make this section cleaner by doing something like:

log.FromContext(ctx).WithValues(
		"instance-type", instanceType,
		"zone", zone,
		"capacity-type", capacityType,
		"ttl", UnavailableOfferingsTTL,
        extraContext...).V(1).Info("removing offering from offerings")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank
Thanks for the suggestion! I really like the cleaner approach you're proposing. However, there's a Go language limitation here that prevents us from using the spread operator directly with maps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the limitation? I'm sure there's a cleaner way to do this.

You might need to change logValues to a []string{}, but I'm very surprised this doesn't work

Copy link
Contributor

@youwalther65 youwalther65 Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank I tried to do other approaches as well but those failed. Got errors like {"level":"DPANIC",..."logger":"controller","caller":"cache/unavailableofferings.go:93","message":"odd number of arguments passed as key-value pairs for logging" because of the way log.FromContext(ctx).WithValuesexpects key-value pairs.

I feel the approach from @moko-poi is valid.

Using a Fault Injection Simulator (FIS) experiment to simulate Spot interruption and a local, patched Karpenter controller with make run, I got the expected log line:
{"level":"DEBUG","time":"2025-10-24T09:10:37.127+0200","logger":"controller","caller":"cache/unavailableofferings.go:105","message":"removing offering from **offerings","controller":"interruption","namespace":"","name":"","reconcileID":"ae3b097e-b068-4d16-bb85-6a8857b5b675","queue":"karpenter-demo","messageKind":"spot_interrupted","NodeClaim":{"name":"default-psv6m"},"action":"CordonAndDrain","Node":{"name":"i-0d4<redacted>"},"instance-type":"c5a.xlarge","zone":"eu-west-1a","capacity-type":"spot","ttl":"3m0s","reason":"spot_interrupted"}

But there is one drawback with this approach though: For InsufficientInstanceCapacity the map contains two key (reason and fleet-id) and because it's a map the range loop might insert these keys in different order, which isn't nice for log parsing (one time the reason might come first, another time the fleet-id).
So we should insert the values in a predictable order. Something like:

	// Add "reason" and "fleet-id" if provided
	unavailableKeys := []string{"reason", "fleet-id"}
	for _, key := range unavailableKeys {
		_, ok := unavailableReason[key]
		if ok {
			logValues = append(logValues, key, unavailableReason[key])
		}

	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we just sort the keys prior to logging. Something like:

keys := lo.Keys(logValues)
slices.Sort(keys)
...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank But then these logs will have a different order compared to other logs.
I'd rather prefer to keep the original order, which is "instance-type", "zone", "capacity-type", "ttl", "reason" and now in addition "fleet-id" in case of InsufficientInstance Capacity.
Or we go back to oneof the initial ideas and move the log.FromContext(ctx).WithValuecompletely out of MarkUnavailable function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep the order of log entries for backwards compatibility we could use:

    logValues := []interface{}{
        "reason", unavailableReason["reason"],
	    "instance-type", instanceType,
	    "zone", zone,
	    "capacity-type", capacityType,
	    "ttl", UnavailableOfferingsTTL,
    }

	// Add fleetID if provided
    key := "fleet-id"
    _, ok := unavailableReason[key]
    if ok {
        logValues = append(logValues, key, unavailableReason[key])
    }       

    log.FromContext(ctx).WithValues(logValues...).V(1).Info("removing offering from offerings")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That makes sense. Since WithValues expects even key/value pairs and Go doesn’t allow spreading a map, this version keeps the log stable and avoids DPANICs.
Let’s go with the fixed-order slice approach as in #8684 — it keeps the output consistent and backward compatible.

instanceType := fleetErr.LaunchTemplateAndOverrides.Overrides.InstanceType
zone := aws.ToString(fleetErr.LaunchTemplateAndOverrides.Overrides.AvailabilityZone)
u.MarkUnavailable(ctx, lo.FromPtr(fleetErr.ErrorCode), instanceType, zone, capacityType)
extraContext := map[string]interface{}{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could this be inlined into the call below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e45c7d8
Done! Inlined the map into the call.

u.offeringCacheSeqNumMu.Unlock()
}

func (u *UnavailableOfferings) MarkUnavailableForFleetErr(ctx context.Context, fleetErr ec2types.CreateFleetError, capacityType string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c845300
You're absolutely right - this function wasn't needed.

return ec2types.CreateFleetInstance{}, cloudprovider.NewCreateError(fmt.Errorf("creating fleet request, %w", err), reason, fmt.Sprintf("Error creating fleet request: %s", message))
}
p.updateUnavailableOfferingsCache(ctx, createFleetOutput.Errors, capacityType, nodeClaim, instanceTypes)
fleetID := aws.ToString(createFleetOutput.FleetId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer to inline these

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! You're right, that variable was only used once.

@DerekFrank
Copy link
Contributor

Thanks for the work on this, I think we're very close to a fantastic solution!

@moko-poi
Copy link
Author

@DerekFrank @youwalther65
Thank you both for the thorough and thoughtful review! I've addressed all the feedback points you raised.

The code is now more direct, type-safe, and easier to understand. Thanks for pushing for these improvements - the end result is much cleaner! 🚀

@youwalther65
Copy link
Contributor

@DerekFrank @youwalther65 Thank you both for the thorough and thoughtful review! I've addressed all the feedback points you raised.

The code is now more direct, type-safe, and easier to understand. Thanks for pushing for these improvements - the end result is much cleaner! 🚀

Thank you @moko-poi !
I did a quick check, It looks like you still use extraContext := map[string]interface{} in controller.go and instance.go ?
Could you use unavailableReason := map[string]string) there as well?!

@moko-poi
Copy link
Author

@youwalther65

5bffe50
Fixed both locations.
Thanks for the review! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log responseElement fleetId from CreateFleet API call to make debugging via AWS CloudTrail easier

3 participants