- 
                Notifications
    You must be signed in to change notification settings 
- Fork 191
feat: Support hot reloading of MCP operation files #1959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Hi @melsonic, make sure to mark this ready for review when you think it's done. For it to be merged, we will need: 
 For this specifically, we'd need it to be off by default, and use the polling based file watcher only, not inotify. We are aware of inotify and that it is theoretically a better solution for cases like this, but in practice it falls short in many environments so the router instead uses the simpler polling approach. Thanks! | 
| 
 Hi @endigma, As of now, I’ve implemented this to be applied by default. I’ll add a new config flag to allow changing that behavior. 
 Let me know your thoughts! | 
| I think it should be possible to perhaps extend the polling based watcher to walk globs or directories, and use a map to keep track of seen files New unseen files or updates to seen files could trigger a reload | 
| Hi @endigma, I've refactored the code to use a polling-based watcher for detecting operation file addition/deletion changes. It's nearly ready. I have one question though. Should we add a separate HotReloadPollInterval config option for MCP operations (like we have for the main config), or reuse the existing PollInterval? | 
| Enabling the watcher, setting its polling interval should be specific, yes Let me know when you're fully done with it, as per the general merge requirements I posted above as well as you being happy with it, then I'll see about getting it into review internally! | 
| Hi @endigma, this PR is ready for review. | 
| Hi @melsonic, thanks for letting me know! I'll see about getting this into the review queue for next week. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll need a few more passes as well, but I have a few initial comments
| Hi @endigma, please review when you have time. | 
| """ WalkthroughThe changes introduce hot reloading support for MCP operation files in the router. This includes new configuration options, updates to the MCP server and operation loader to support directory watching with reload intervals, and tests to verify hot reload behavior. The schema, configuration files, and internal logic are updated accordingly. Changes
 Assessment against linked issues
 Assessment against linked issues: Out-of-scope changesNo out-of-scope changes were found. All code changes are directly related to supporting and testing hot reloading of MCP operation files as described in the linked issue. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (8)
 ✅ Files skipped from review due to trivial changes (1)
 🚧 Files skipped from review as they are similar to previous changes (6)
 🔇 Additional comments (1)
 ✨ Finishing Touches
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 CodeRabbit Configuration File ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
router/core/router.go (1)
851-852: Previous feedback addressed – option condensed into single constructor
mcpserver.WithHotReload(r.mcp.HotReloadConfig.Enabled, r.mcp.HotReloadConfig.Interval)implements the “one option (bool, time.Duration)” pattern requested earlier – thanks for the clean-up.
No functional or stylistic concerns.
🧹 Nitpick comments (1)
router/pkg/watcher/watcher.go (1)
84-99: Consider consistent error handling between initial load and runtimeDuring initial directory listing, errors cause the watcher to fail (line 87), but during runtime file listing, errors are only logged (line 121). This inconsistency might lead to different behavior between startup and runtime.
Consider making the error handling consistent:
dirFilePaths, err := listDirFilePaths() if err != nil { - ll.Error("failed to list directory files", zap.Error(err)) - return err + ll.Error("failed to list directory files during initialization", zap.Error(err)) + // Continue with empty list to allow graceful degradation + dirFilePaths = []string{} }Or alternatively, make both cases return errors for consistency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
- router-tests/mcp_hot_reload_test.go(1 hunks)
- router-tests/testenv/testenv.go(1 hunks)
- router/core/graph_server.go(1 hunks)
- router/core/router.go(1 hunks)
- router/pkg/config/config.go(2 hunks)
- router/pkg/config/config.schema.json(1 hunks)
- router/pkg/config/fixtures/full.yaml(1 hunks)
- router/pkg/config/testdata/config_defaults.json(1 hunks)
- router/pkg/config/testdata/config_full.json(1 hunks)
- router/pkg/mcpserver/operation_manager.go(2 hunks)
- router/pkg/mcpserver/server.go(7 hunks)
- router/pkg/schemaloader/loader.go(3 hunks)
- router/pkg/watcher/watcher.go(5 hunks)
🧰 Additional context used
🧠 Learnings (4)
router/pkg/config/testdata/config_full.json (1)
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
router/pkg/config/testdata/config_defaults.json (2)
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
Learnt from: endigma
PR: wundergraph/cosmo#2009
File: router/pkg/config/config.go:0-0
Timestamp: 2025-07-03T10:33:25.755Z
Learning: The CardinalityLimit field in the Metrics struct (router/pkg/config/config.go) is validated at the JSON schema level in config.schema.json with a minimum value constraint of 1, preventing zero or negative values without requiring runtime validation.
router/pkg/config/config.schema.json (1)
undefined
<retrieved_learning>
Learnt from: SkArchon
PR: #1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
</retrieved_learning>
router/pkg/schemaloader/loader.go (1)
Learnt from: SkArchon
PR: wundergraph/cosmo#1929
File: router/internal/circuit/manager.go:16-25
Timestamp: 2025-06-30T20:39:02.376Z
Learning: In the Cosmo router project, parameter validation for circuit breaker configuration is handled at the JSON schema level rather than through runtime validation methods on structs. The config.schema.json file contains comprehensive validation constraints for circuit breaker parameters.
🧬 Code Graph Analysis (3)
router-tests/testenv/testenv.go (2)
router/core/router.go (1)
WithStorageProviders(1984-1988)router/pkg/config/config.go (2)
StorageProviders(721-726)
FileSystemStorageProvider(753-756)
router/core/router.go (1)
router/pkg/mcpserver/server.go (1)
WithHotReload(287-292)
router/pkg/mcpserver/operation_manager.go (1)
router/pkg/schemaloader/loader.go (1)
NewOperationLoader(42-47)
🔇 Additional comments (19)
router-tests/testenv/testenv.go (1)
1434-1446: LGTM: Conditional storage provider setup prevents configuration conflicts.The conditional check ensures that a default filesystem storage provider is only added when no MCP storage provider ID is already configured. This prevents overwriting existing provider configurations and avoids duplicate provider registrations, which is essential for the new hot reload testing scenarios.
The logic correctly groups the storage provider setup and provider ID assignment within the same conditional block, maintaining consistency.
router/pkg/config/fixtures/full.yaml (1)
52-54: LGTM: Hot reload configuration follows best practices.The new
hot_reload_configsection provides sensible defaults:
enabled: falseensures backward compatibility and follows the agreed approach of having hot reload disabled by default
interval: '10s'strikes a good balance between responsiveness and resource efficiency for the polling-based file watcherThe configuration structure is well-organized under the MCP section and properly formatted.
router/pkg/config/testdata/config_full.json (1)
160-164: Hot-reload configuration wired correctly – no issues spottedThe new
HotReloadConfigblock follows existing duration conventions (nanoseconds) and keeps the previousRouterURLfield syntactically valid by adding the comma.
Nothing further to flag here.router/pkg/config/testdata/config_defaults.json (1)
125-129: Defaults updated consistentlyThe default configuration now exposes the same
HotReloadConfigshape as the full fixture. The interval (10 s) is a sensible conservative default.
LGTM.router/pkg/config/config.go (2)
916-916: LGTM: Clean configuration field addition.The new
HotReloadConfigfield is well-integrated into theMCPConfigurationstruct, following established patterns for YAML serialization and environment variable naming.
928-932: LGTM: Well-designed configuration struct with sensible defaults.The
MCPOperationsHotReloadConfigstruct has thoughtful design choices:
Enabled: falseby default ensures safe feature rollout
10sinterval provides reasonable balance between responsiveness and system load- Clear field names and consistent environment variable naming
router/pkg/mcpserver/operation_manager.go (1)
35-41: LGTM!The method signature update properly propagates the hot reload parameters to the underlying loader.
router/pkg/schemaloader/loader.go (2)
142-162: Consider returning watcher creation errorsWhen the watcher creation fails, the error is only logged. This means hot reload won't work but the initial load succeeds. While this might be intentional for graceful degradation, it could lead to confusion when hot reload silently fails to work.
Consider whether watcher creation errors should fail the entire operation:
if err != nil { cancel() l.Logger.Error("Could not create watcher", zap.Error(err)) + return nil, fmt.Errorf("failed to create watcher for hot reload: %w", err) }Alternatively, if graceful degradation is desired, consider adding a warning-level log to make it more visible.
163-171: Well-implemented error handlingThe goroutine properly distinguishes between expected context cancellation and unexpected errors, with appropriate logging for each case.
router-tests/mcp_hot_reload_test.go (2)
18-206: Excellent test coverage!The tests comprehensively cover the hot reload functionality:
- Proper use of temporary directories for test isolation
- Good coverage of add/update/remove scenarios
- Appropriate use of
EventuallyWithTfor async operations- Well-structured test assertions
208-282: Good goroutine leak detection!The test properly verifies that all goroutines are cleaned up during MCP shutdown, which addresses the concern raised in previous reviews about goroutine cleanup.
router/pkg/watcher/watcher.go (1)
139-158: Well-implemented change detection logicThe implementation correctly handles:
- New file detection by tracking seen files
- Update detection via modification time comparison
- Deletion detection by comparing file counts and cleaning up the seen files map
router/pkg/mcpserver/server.go (7)
71-74: Well-structured hot reload configuration options.The new configuration fields are properly documented and follow the existing code style. The field names are clear and descriptive.
95-97: Good integration of hot reload fields into the server struct.The new fields are appropriately placed and follow the existing naming conventions. The
reloadOperationsChanchannel enables asynchronous communication for reload events.
223-225: Proper initialization of hot reload fields.The initialization correctly sets the hot reload configuration from options and creates the reload channel. The channel is properly initialized as an unbuffered channel.
286-292: Well-implemented option function for hot reload configuration.The
WithHotReloadfunction follows the established pattern for option functions and correctly sets both the hot reload flag and interval.
342-342: Context parameter addition improves cancellation support.Adding the context parameter to the
Reloadmethod signature enables proper cancellation and timeout handling for reload operations.
745-747: Public method exposes reload channel for external coordination.The
ReloadOperationsChannelmethod provides a clean interface for external components to access the reload channel, enabling proper coordination of hot reload events.
351-351: LoadOperationsFromDirectory context & parameters are correctly forwarded
- router/pkg/mcpserver/server.go:351 calls
s.operationsManager.LoadOperationsFromDirectory(ctx, s.operationsDir, s.reloadOperationsChan, s.hotReload, s.hotReloadInterval)- router/pkg/mcpserver/operation_manager.go:35 signature
func (om *OperationsManager) LoadOperationsFromDirectory(ctx context.Context, operationsDir string, reloadOperationsChan chan bool, hotReload bool, hotReloadInterval time.Duration) error- router/pkg/schemaloader/loader.go signature matches and also accepts the same parameters
No changes needed here—context and hot-reload settings are properly propagated.
| if mErr := s.mcpServer.Reload(ctx, executor.ClientSchema); mErr != nil { | ||
| return | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle reload errors without terminating the goroutine
If Reload returns an error, the goroutine exits and no further hot reloads will be processed. This could leave the system unable to recover from transient reload failures.
Consider logging the error and continuing to listen for reload signals:
 case <-s.mcpServer.ReloadOperationsChannel():
     s.logger.Log(zap.InfoLevel, "Reloading mcp server!")
     if mErr := s.mcpServer.Reload(ctx, executor.ClientSchema); mErr != nil {
-        return
+        s.logger.Error("Failed to reload MCP server", zap.Error(mErr))
     }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if mErr := s.mcpServer.Reload(ctx, executor.ClientSchema); mErr != nil { | |
| return | |
| } | |
| case <-s.mcpServer.ReloadOperationsChannel(): | |
| s.logger.Log(zap.InfoLevel, "Reloading mcp server!") | |
| if mErr := s.mcpServer.Reload(ctx, executor.ClientSchema); mErr != nil { | |
| s.logger.Error("Failed to reload MCP server", zap.Error(mErr)) | |
| } | 
🤖 Prompt for AI Agents
In router/core/graph_server.go around lines 1088 to 1090, the current code
returns immediately if s.mcpServer.Reload returns an error, causing the
goroutine to exit and stop processing further reloads. Modify the code to log
the error instead of returning, allowing the goroutine to continue running and
handle subsequent reload signals without termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @endigma, Please let me know if it is the expected behaviour to further retry mcpServer reload operations even after reload failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| Hi @endigma, I have updated the cosmo watcher to watch CRUD operations in a directory with a file filter. In this scenario, the filefilter would help us to tell the watcher to only watch graphql files. Thank you! | 
| Hi @endigma, please review this MR once you have time. | 
| Hi @melsonic , you can just re-request when its ready, can you resolve the conflict? I'll put this back in the queue for review. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, some high level review:
- You should add unit tests for the new watcher behaviour in watcher_test.go
- We don't need to support DirectoryandPathsoperating modes at the time same, we can useDirectoryas a way to build a list of paths or use a provided one, and we can rebuild the list of tracked files on every loop for the directory mode.
        
          
                router-tests/mcp_hot_reload_test.go
              
                Outdated
          
        
      | t.Run("List Updated User Operations On Addition and Removal", func(t *testing.T) { | ||
|  | ||
| operationsDir := t.TempDir() | ||
| storageProviderId := "mcp_hot_reload_test_id" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storageProviderId should be ID and it seems the same across all subtests so can we move it out to test scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can.
        
          
                router-tests/mcp_hot_reload_test.go
              
                Outdated
          
        
      |  | ||
| initialToolsCount := len(resp.Tools) | ||
|  | ||
| filePath := operationsDir + "/main.graphql" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using filepath.Join is best for this sort of thing, and can it be given a more descriptive name? This applies to all subtests.
        
          
                router-tests/mcp_hot_reload_test.go
              
                Outdated
          
        
      | assert.EventuallyWithT(t, func(t *assert.CollectT) { | ||
|  | ||
| resp, err = xEnv.MCPClient.ListTools(xEnv.Context, toolsRequest) | ||
| assert.NoError(t, err) | ||
| assert.Len(t, resp.Tools, initialToolsCount) | ||
|  | ||
| // verity getEmployeeNotes operation tool is properly removed | ||
| require.NotContains(t, resp.Tools, mcp.Tool{ | ||
| Name: "execute_operation_get_employee_notes", | ||
| Description: "Executes the GraphQL operation 'getEmployeeNotes' of type query.", | ||
| InputSchema: mcp.ToolInputSchema{ | ||
| Type: "object", | ||
| Properties: map[string]interface{}{"id": map[string]interface{}{"type": "integer"}}, | ||
| Required: []string{"id"}, | ||
| }, | ||
| RawInputSchema: json.RawMessage(nil), | ||
| Annotations: mcp.ToolAnnotation{ | ||
| Title: "Execute operation getEmployeeNotes", | ||
| ReadOnlyHint: mcp.ToBoolPtr(true), | ||
| IdempotentHint: mcp.ToBoolPtr(true), | ||
| OpenWorldHint: mcp.ToBoolPtr(true), | ||
| }, | ||
| }) | ||
|  | ||
| }, 15*time.Second, 250*time.Millisecond) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using assert.Eventually... functions, we recently introduced an option to the watcher to directly provide a non-timing based tick source. You can check watcher_test.go for how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll check and update the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of mcp hot reload, we don't have direct access to the watcher config options, the watcher is defined in the loader.go file. So, in this particular scenario it is not possible to use non-timing based tick source.
        
          
                router/pkg/watcher/watcher.go
              
                Outdated
          
        
      | listDirFilePaths := func() ([]string, error) { | ||
| var files []string | ||
| if options.Directory.DirPath != "" { | ||
| err := filepath.WalkDir(options.Directory.DirPath, func(path string, d fs.DirEntry, err error) error { | ||
| if err != nil { | ||
| return err | ||
| } | ||
| // Skip directories | ||
| if d.IsDir() { | ||
| return nil | ||
| } | ||
| // Skip if filter rejects the file | ||
| if options.Directory.Filter != nil && options.Directory.Filter(path) { | ||
| return nil | ||
| } | ||
| files = append(files, path) | ||
| return nil | ||
| }) | ||
| if err != nil { | ||
| return []string{}, fmt.Errorf("error walking directory %s: %w", options.Directory.DirPath, err) | ||
| } | ||
| } | ||
| return files, nil | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be extracted and tested separately? Also, "filter" semantics usually return true when something should remain in the list, not be removed from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll create a util function outside New with unit test.
        
          
                router/pkg/watcher/watcher.go
              
                Outdated
          
        
      | if len(options.Paths) == 0 { | ||
| return nil, errors.New("path must be provided") | ||
| if len(options.Paths) == 0 && options.Directory.DirPath == "" { | ||
| return nil, errors.New("paths or directory must be provided") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should error if both Paths and Directory are provided, and I don't believe filter is necessary at the moment. This will allow you to deduplicate all the loops below and integrate Directories as just a source for a list of paths instead of separate handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying the required design. It would definitely prevent duplicate runs from out function.
        
          
                router/pkg/watcher/watcher.go
              
                Outdated
          
        
      | } | ||
|  | ||
| prevModTimes := make(map[string]time.Time) | ||
| seenDirFilePaths := make(map[string]struct{}) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose does this serve? You maintain it as an index of the currently known file paths but don't use it for anything that prevModTimes can't be used for from what I can tell. We can already detect deleted files by looping over the prevModTimes map and comparing it with os.Stat result of the expected file. We already handle this by resetting the time so should it be created again it will be detected as an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it.
| Hi @endigma, I have updated the PR incorporating your comments. However, I do have these following questions. 
 | 
        
          
                router/pkg/watcher/watcher.go
              
                Outdated
          
        
      | } | ||
| } | ||
|  | ||
| if options.Directory.DirPath != "" { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reloading the updated directory files after going through the previously listed files. The will allow us to remove deleted files from prevModTimes in one loop. However there is a catch, in case of a new file addition, instead of 2 ticks, it will consume 3 ticks to run the provided Callback func.
On the other hand, in case we list the updated directory files before going through options.Paths, we'll have to run another loop on prevModTimes to remove keys for files, that are deleted from the directory.
Currently I have implemented the first approach, let me know if the team prefers the 2nd option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally choose option 2, looping over a map and updating it is not terribly slow and hot reloading is nowhere near a hot path. More intuitive or consistent reload behavior is more important than raw speed of a single tick handler
| Hi @melsonic, I think having 0 paths to check is fine, provided we defensively prevent anything bad from happening in this case. This would allow gracefully handling a multi-tick wide total directory swap or something like a human speed drag-out -> drag-in operation. Generally with watcher code we prefer resilience over all, e.g not failing out when we hit an invalid state and instead just falling back to a previous working state until some iteration is usable. In this case I believe having mcp enabled but with 0 operations should not be a failure case, but I'm not sure at the moment | 
| And yes, watcher constructor should fail if the arguments themselves are both empty | 
| Hi @endigma, I am ready with my changes, please review it once you have time. | 
| Hi @endigma, Please review. P.S: If you are busy with some other high priority task, please take your time. Commenting just in case you missed my previous review notification. | 
| Hi @melsonic, this is low priority internally at the moment, but rest assured it's in the queue | 
| No problem @endigma, just wanted to know if it's in the queue or not. | 
| This PR was marked stale due to lack of activity. It will be closed in 14 days. | 
| Unstale | 
| This PR was marked stale due to lack of activity. It will be closed in 14 days. | 
| unstale | 
| Hi @melsonic, we've decided internally to adopt your work as the basis for our own fix for the original issue, as such, we'll be closing this PR for now. Thanks for your contribution! | 
cosmo.mp4
Motivation and Context
Checklist
/fixes #1886
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation