Skip to content

feat: Add ConfigMap watching for faster token re-issuance#313

Open
ssyno wants to merge 11 commits intomainfrom
feat/configmap-watching-for-faster-token-reissuance
Open

feat: Add ConfigMap watching for faster token re-issuance#313
ssyno wants to merge 11 commits intomainfrom
feat/configmap-watching-for-faster-token-reissuance

Conversation

@ssyno
Copy link
Collaborator

@ssyno ssyno commented Sep 3, 2025

This implementation adds a new ConfigMap controller that watches for changes to the teleport-operator ConfigMap and triggers immediate token re-issuance when configuration updates occur, improving response time from 5+ minutes to seconds.

Key features:

  • Event-driven ConfigMap watching with predicate filtering
  • Intelligent change detection with 4-level impact classification:
    • Critical (ProxyAddr): Forces reconnection + token regeneration
    • High (ManagementClusterName): Invalidates all tokens
    • Medium (TeleportVersion/AppName): Updates ConfigMaps
    • Low (AppVersion/AppCatalog): No immediate action required
  • Immediate cluster reconciliation triggering via annotations
  • Comprehensive unit tests covering all change scenarios
  • Backward compatible - no breaking changes

Performance improvements:

  • ProxyAddr changes: 5+ minutes → ~seconds (>100x faster)
  • ManagementClusterName changes: 5+ minutes → ~seconds (>100x faster)
  • TeleportVersion changes: Next restart → ~seconds (near-instant)

Files added:

  • internal/controller/config_controller.go - Main implementation
  • internal/controller/config_controller_test.go - Unit tests

Files modified:

  • main.go - Added ConfigMap controller registration

What this PR does / why we need it

Checklist

  • Update changelog in CHANGELOG.md.

This implementation adds a new ConfigMap controller that watches for changes
to the teleport-operator ConfigMap and triggers immediate token re-issuance
when configuration updates occur, improving response time from 5+ minutes to seconds.

Key features:
- Event-driven ConfigMap watching with predicate filtering
- Intelligent change detection with 4-level impact classification:
  * Critical (ProxyAddr): Forces reconnection + token regeneration
  * High (ManagementClusterName): Invalidates all tokens
  * Medium (TeleportVersion/AppName): Updates ConfigMaps
  * Low (AppVersion/AppCatalog): No immediate action required
- Immediate cluster reconciliation triggering via annotations
- Comprehensive unit tests covering all change scenarios
- Backward compatible - no breaking changes

Performance improvements:
- ProxyAddr changes: 5+ minutes → ~seconds (>100x faster)
- ManagementClusterName changes: 5+ minutes → ~seconds (>100x faster)
- TeleportVersion changes: Next restart → ~seconds (near-instant)

Files added:
- internal/controller/config_controller.go - Main implementation
- internal/controller/config_controller_test.go - Unit tests

Files modified:
- main.go - Added ConfigMap controller registration
@ssyno ssyno requested a review from a team as a code owner September 3, 2025 12:59
Comment on lines +64 to +65
log.Info("ConfigMap deleted, but we continue with cached config")
return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This controller does not currently continue with cached config. Why should the controller cache any config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats right, the log message is misleading since we don't actually cache config in this controller. I'll update it.

Comment on lines +56 to +59
// Only process the teleport-operator ConfigMap
if req.Name != key.TeleportOperatorConfigName {
return ctrl.Result{}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is theoretically handled by the predicate function, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing it.

Comment on lines +127 to +132
var changes []ConfigChange

if oldConfig == nil {
// First time seeing config, no changes to process
return changes
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the logic should differentiate between "nothing changed between old and new" and "there was no old config". Wouldn't everything in a new config be a change if there was no old config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during startup when there's no old config, do we want to treat the initial configuration as changes that trigger reconciliation actions. The system is initializing and all components will naturally use the new config through the normal startup flow.

Comment on lines +235 to +242
// Log all changes for audit purposes
for _, change := range changes {
log.Info("Configuration change processed",
"field", change.Field,
"oldValue", change.OldValue,
"newValue", change.NewValue,
"impact", r.impactString(change.Impact))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This audit is lost if any of the previous steps error out

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it to happen immediately after change detection

}

// handleConfigChanges processes detected configuration changes
func (r *ConfigReconciler) handleConfigChanges(ctx context.Context, log logr.Logger, changes []ConfigChange) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT the only difference in any of the code paths from here are log lines and clearing the local object teleport identity.

Why is this whole impact system necessary if the outcome is always the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to refactor this one

}

// Add annotation to trigger reconciliation
cluster.Annotations["teleport-operator.giantswarm.io/config-updated"] = timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation string should be a const.

What were your thoughts on using a timestamp vs a config hash?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am going to move this one to the key package

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp works well since we've already filtered for meaningful changes in detectConfigChanges(). If we find that identical configs are causing issues, we could switch to a config hash

return ctrl.NewControllerManagedBy(mgr).
For(&corev1.ConfigMap{}).
WithOptions(controller.Options{
MaxConcurrentReconciles: 1, // Process config changes sequentially
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does sequence / concurrency matter for this controller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config changes are rare events so they have predictable behavior without significant performance impact using sequential processing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused because 1 is already the default for that setting, so by explicitly setting it I thought you might have already identified a race condition or something that limits concurrency. If that's not the case, I don't think you really need to set the value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can get rid of it, since 1 is already the default for MaxConcurrentReconciles

@stone-z
Copy link
Contributor

stone-z commented Sep 9, 2025

To generalize my feedback:

The purpose of this controller is to force the existing cluster controller to re-reconcile Cluster CRs when a single ConfigMap changes.
So, the minimum functionality for this controller IIUC would be to 1. ignore CM deletions, and otherwise: 2. apply any "new" value as an annotation to the Cluster CRs (hash, timestamp, random number, etc.). The contents of the CM don't matter to this controller.

So, is the other logic necessary? Are there more use cases I missed that you're trying to support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants