feat: Add ConfigMap watching for faster token re-issuance#313
feat: Add ConfigMap watching for faster token re-issuance#313
Conversation
This implementation adds a new ConfigMap controller that watches for changes to the teleport-operator ConfigMap and triggers immediate token re-issuance when configuration updates occur, improving response time from 5+ minutes to seconds. Key features: - Event-driven ConfigMap watching with predicate filtering - Intelligent change detection with 4-level impact classification: * Critical (ProxyAddr): Forces reconnection + token regeneration * High (ManagementClusterName): Invalidates all tokens * Medium (TeleportVersion/AppName): Updates ConfigMaps * Low (AppVersion/AppCatalog): No immediate action required - Immediate cluster reconciliation triggering via annotations - Comprehensive unit tests covering all change scenarios - Backward compatible - no breaking changes Performance improvements: - ProxyAddr changes: 5+ minutes → ~seconds (>100x faster) - ManagementClusterName changes: 5+ minutes → ~seconds (>100x faster) - TeleportVersion changes: Next restart → ~seconds (near-instant) Files added: - internal/controller/config_controller.go - Main implementation - internal/controller/config_controller_test.go - Unit tests Files modified: - main.go - Added ConfigMap controller registration
| log.Info("ConfigMap deleted, but we continue with cached config") | ||
| return ctrl.Result{}, nil |
There was a problem hiding this comment.
This controller does not currently continue with cached config. Why should the controller cache any config?
There was a problem hiding this comment.
thats right, the log message is misleading since we don't actually cache config in this controller. I'll update it.
| // Only process the teleport-operator ConfigMap | ||
| if req.Name != key.TeleportOperatorConfigName { | ||
| return ctrl.Result{}, nil | ||
| } |
There was a problem hiding this comment.
This is theoretically handled by the predicate function, right?
| var changes []ConfigChange | ||
|
|
||
| if oldConfig == nil { | ||
| // First time seeing config, no changes to process | ||
| return changes | ||
| } |
There was a problem hiding this comment.
IMO the logic should differentiate between "nothing changed between old and new" and "there was no old config". Wouldn't everything in a new config be a change if there was no old config?
There was a problem hiding this comment.
during startup when there's no old config, do we want to treat the initial configuration as changes that trigger reconciliation actions. The system is initializing and all components will naturally use the new config through the normal startup flow.
| // Log all changes for audit purposes | ||
| for _, change := range changes { | ||
| log.Info("Configuration change processed", | ||
| "field", change.Field, | ||
| "oldValue", change.OldValue, | ||
| "newValue", change.NewValue, | ||
| "impact", r.impactString(change.Impact)) | ||
| } |
There was a problem hiding this comment.
This audit is lost if any of the previous steps error out
There was a problem hiding this comment.
I will move it to happen immediately after change detection
| } | ||
|
|
||
| // handleConfigChanges processes detected configuration changes | ||
| func (r *ConfigReconciler) handleConfigChanges(ctx context.Context, log logr.Logger, changes []ConfigChange) error { |
There was a problem hiding this comment.
AFAICT the only difference in any of the code paths from here are log lines and clearing the local object teleport identity.
Why is this whole impact system necessary if the outcome is always the same?
There was a problem hiding this comment.
I am going to refactor this one
| } | ||
|
|
||
| // Add annotation to trigger reconciliation | ||
| cluster.Annotations["teleport-operator.giantswarm.io/config-updated"] = timestamp |
There was a problem hiding this comment.
The annotation string should be a const.
What were your thoughts on using a timestamp vs a config hash?
There was a problem hiding this comment.
am going to move this one to the key package
There was a problem hiding this comment.
timestamp works well since we've already filtered for meaningful changes in detectConfigChanges(). If we find that identical configs are causing issues, we could switch to a config hash
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&corev1.ConfigMap{}). | ||
| WithOptions(controller.Options{ | ||
| MaxConcurrentReconciles: 1, // Process config changes sequentially |
There was a problem hiding this comment.
Why does sequence / concurrency matter for this controller?
There was a problem hiding this comment.
config changes are rare events so they have predictable behavior without significant performance impact using sequential processing
There was a problem hiding this comment.
I was confused because 1 is already the default for that setting, so by explicitly setting it I thought you might have already identified a race condition or something that limits concurrency. If that's not the case, I don't think you really need to set the value
There was a problem hiding this comment.
yeah we can get rid of it, since 1 is already the default for MaxConcurrentReconciles
|
To generalize my feedback: The purpose of this controller is to force the existing cluster controller to re-reconcile Cluster CRs when a single ConfigMap changes. So, is the other logic necessary? Are there more use cases I missed that you're trying to support? |
This implementation adds a new ConfigMap controller that watches for changes to the teleport-operator ConfigMap and triggers immediate token re-issuance when configuration updates occur, improving response time from 5+ minutes to seconds.
Key features:
Performance improvements:
Files added:
Files modified:
What this PR does / why we need it
Checklist