Skip to content

Conversation

@youwalther65
Copy link
Contributor

@youwalther65 youwalther65 commented Oct 27, 2025

Alternate approach to solve #8482 based on ideas from @moko-poi in PR #8547

Fixes #8482

Description
Adds FleetID to Karpenter controller logs in case of InsufficientInstanceCapacity error in EC2 CreateFleet API call.

Moved logging out of pkg/cache/unavailableofferings.go func MarkUnavailable.

After offline discussion with @DerekFrank I moved logging back to pkg/cache/unavailableofferings.go func MarkUnavailable, but now keep log item order backward compatible using approach here.

How was this change tested?

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@youwalther65 youwalther65 requested a review from a team as a code owner October 27, 2025 09:04
@youwalther65 youwalther65 requested a review from rschalo October 27, 2025 09:04
@netlify
Copy link

netlify bot commented Oct 27, 2025

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit 114a892
🔍 Latest deploy log https://app.netlify.com/projects/karpenter-docs-prod/deploys/69033af31ddacc00087ea613
😎 Deploy Preview https://deploy-preview-8684--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@youwalther65 youwalther65 changed the title Alternate approach to solve #8482 based on ideas from moko-poi in PR #8547 Alternate approach to solve #8482 based on ideas from @moko-poi in PR #8547 Oct 27, 2025
"capacity-type", karpv1.CapacityTypeSpot,
"ttl", awscache.UnavailableOfferingsTTL,
"fleet-id", fleetID,
).V(1).Info("removing offering from offerings")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I love having two logs. Whats the reason for not using the previous mechanism?

Copy link
Contributor Author

@youwalther65 youwalther65 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank The primary motivation is to keep the order of the log items consistent i.e. backward compatible with order reason, instance-type, zone, capacity-type and only in case of"reason":"InsufficientInstanceCapacity" we add we add fleet-id like:

{"level":"DEBUG","time":"2025-08-11T11:22:26.753Z","logger":"controller","caller":"cache/unavailableofferings.go:73","message":"removing offering from offerings","commit":"434f54c","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":""},"namespace":"","name":"<redacted","reconcileID":"1569d46a-22ad-4d50-be67-f2e0392df3dd","reason":"InsufficientInstanceCapacity","instance-type":"r6a.32xlarge","zone":"eu-west-1b","capacity-type":"on-demand","ttl":"3m0s","fleet-id":"fleet-XXX"}

This is important, because they are Cx relying on the order, using regex pattern to query or tools like LogParserForKarpenter or other custom log parsing.

You rejected @moko-poi approach of adding just the fleet-id as last argument to func MarkUnavailable, because in case of spot interruption, there is no fleet-id and we would have an empty " fleet-id:"" attribute., which could confuse users.

Using the second approach with a map unavailableReason with key reason and fleet-id would move reason to the end of the log line, breaking backwards compatibility, and we have to add sorting for this map, which doesn't look nice as well.

To be clear: This PR does not create two log lines for the same event, it's just the call to log.FromContext in two different locations, either in instance.go for InsufficientInstanceCapacity event or one in controller.go for Spot interruptions.
So my approach keeps the func MarkUnavailable clean and just have a func signature with all attributes stored in cache offeringCache, because reasonand fllet-idare not values stored in this cache. In addition the logging is done where the corresponding event happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasoning behind having the log line be in two places. I think having the sorting is cleaner for backwards compatibility, especially if we intend to add more log information to this field in the future. We can discuss offline and get a path forward

@moko-poi
Copy link

Thanks @youwalther65 and @DerekFrank for following up on this!

I'm totally fine with continuing this work in #8684 — it keeps the intent of #8547 while making the log order deterministic and preserving backward compatibility.
The main goal from my side was always to surface the FleetID for CloudTrail correlation, without adding noise (like empty fleet-id fields) or changing the existing log structure.

Happy to close #8547 once this PR becomes the final implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log responseElement fleetId from CreateFleet API call to make debugging via AWS CloudTrail easier

3 participants