Snapshot frame #6

DiegoTavares · 2025-03-07T17:43:56Z

All frames create a snapshot right after being spawned with the intention of allowing a future recover in
case rqd restarts. During start time, rqd reads all snapshots on the configured directory
(config.runner.snapshots_path) and attempts to reconnect to all snapshot'ed frames.

When rqd restarts all frames currently running become parented by the init/systemd, therefore the
recovery process is no longer able to bind to them and follow their execution directly.

To accomplish a similar behavior, on recovery mode, running_frame polls it's pid waiting for it to
finish and recovers the exit status and signal from a file. To accomplish that, all frame commands are
now appended with a logic to save their exit status to a file.

Summary by CodeRabbit

New Features
- Updated service configuration with a new gRPC port and dedicated snapshot storage.
- Enhanced the command-line interface by requiring explicit server address parameters with sensible defaults and adding a frame launch command that supports environment variable customization.
- Introduced improved server operations and frame management, including snapshot recovery and robust process handling.
- Added serialization capabilities for several data structures, enhancing data interchange support.
- Introduced a cache for managing currently running frames, improving performance and resource management.
- Added a new method for managing CPU reservations and releases, enhancing resource management capabilities.
- Introduced a new FrameManager for managing frame lifecycles and improved error handling during frame operations.
Refactor
- Streamlined dependency updates and resource validation logic.
- Consolidated logging and process management for better clarity and performance.
- Improved the organization of module imports and public interfaces.
- Updated error handling mechanisms to provide more detailed feedback.
Bug Fixes
- Enhanced error handling during CPU core reservations and releases, ensuring more robust operation.

Move frame control logic from of rqd_servant to manager

When rqd restarts all frames currently running become parented by the init/systemd, therefore the recovery process is no longer able to bind to them and follow their execution directly. To accomplish a similar behaviour, on recovery mode, running_frame pools it's pid waiting for it to finish and recovers the exit status and signal from a file. To accomplish that, all frame commands are now appended with a logic to save their exit status to a file.

coderabbitai · 2025-03-07T17:44:05Z

Caution

Review failed

The pull request is closed.

Walkthrough

The changes update various Cargo.toml files to specify a workspace resolver and adjust dependencies, update configuration values, and restructure command-line handling across multiple crates. In the dummy-cuebot crate, new structures and modules are introduced to handle server and client commands with enhanced argument parsing and environment variable management. The opencue-proto crate now supports Serde serialization via added dependencies and derive macros. In the rqd crate, a new frame management system is integrated: a dedicated FrameManager replaces direct cache usage, with modifications to frame command construction, logging, snapshot recovery, and overall servant initialization.

Changes

File(s)	Change Summary
Cargo.toml (workspace, crates/dummy-cuebot, crates/opencue-proto, crates/rqd)	Workspace Cargo.toml: Added `resolver = "3"`. Dummy-cuebot: Removed `itertools`, added `users`. Opencue-proto: Added `serde` (with derive) and `serde_derive`. rqd: Added `bincode`; updated `uuid` (added serde feature) and `nix` dependencies.
config/rqd.fake_linux.yaml	Updated `grpc.rqd_port` from `4778` to `8444`; added new `runner.snapshots_path` entry.
crates/dummy-cuebot/src/{main.rs, report_servant.rs, rqd_client.rs}	Updated CLI argument parsing (made fields required with defaults); introduced new structs (`LaunchFrameCmd`, `EnvVars`); added `DummyCuebotServer` with a `start_server` method; and implemented `DummyRqdClient` with connection building and frame launching logic.
crates/opencue-proto/{build.rs, src/lib.rs}	Added serde serialization derives to multiple types in `build.rs`; introduced a new public `resource_id` method on `RunFrame` in `src/lib.rs`.
crates/rqd/src/{config/config.rs, frame/, main.rs, servant/, system/*}	Updated `RunnerConfig` by adding `snapshots_path` and changing `run_as_user` default; introduced new module `RunningFrameCache` and a robust `FrameManager` for frame lifecycle (including spawning, snapshot recovery, and logging updates); updated command builders (added `exit_file` support); and refactored servant modules to use `FrameManager` over the legacy cache.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as DummyCuebotCli
    participant Client as DummyRqdClient
    participant Manager as FrameManager
    participant Server as RqdServant

    CLI->>Client: Parse & prepare launch command
    Client->>Manager: spawn(run_frame) request
    Manager->>Server: Initiate frame launch in a new thread
    Server-->>Manager: Acknowledge launch
    Manager-->>Client: Return launch status

sequenceDiagram
    participant Manager as FrameManager
    participant SnapshotDir as Snapshots Directory
    participant Cache as RunningFrameCache

    Manager->>SnapshotDir: Read snapshot files
    SnapshotDir-->>Manager: Provide valid snapshots
    Manager->>Cache: Recover & insert running frames
    Cache-->>Manager: Confirm recovery

Poem

I'm a coding rabbit, hopping through the code,
New keys and commands light an upgraded road.
Frames now dance under a manager’s keen eye,
Snapshots recovered as moments flutter by.
With each small update, I cheer and I play,
Bouncing in bytes to a brighter new day! 🐰✨

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)

We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
- To enable this feature, set early_access to true under in the settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d1fb300 and 2a8b94e.

📒 Files selected for processing (1)

crates/rqd/src/frame/running_frame.rs (21 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (1)

crates/rqd/src/servant/running_frame_servant.rs (1)

24-37: ⚠️ Potential issue

Incomplete implementation of RunningFrame trait methods.

The implementation of the kill and status methods are still marked as todo!(). These methods need to be implemented for the snapshot recovery functionality to work correctly.

These methods are critical for the frame management functionality described in the PR objectives. Please implement them before merging this PR to ensure the snapshot mechanism works properly.

🧹 Nitpick comments (12)

crates/rqd/src/main.rs (1)

89-94: Helpful system capability documentation

These comments provide valuable information about the Linux capabilities required for process group management, which is essential for the frame recovery mechanism described in the PR objectives.

Consider moving these comments to a more appropriate location such as documentation files or as code documentation closer to where these capabilities are used.

crates/dummy-cuebot/src/rqd_client.rs (2)

16-27: Consider parallelizing or pooling gRPC connections for better concurrency.

Currently, all requests share a single Arc<Mutex<RqdInterfaceClient<Channel>>>, which can become a bottleneck if multiple frames are launched concurrently. You may consider using a connection pool or creating ephemeral clients per request to improve scalability.

43-44: Make the fallback username configurable.

Using daemon as the default username is acceptable in many cases, but it might be more flexible to let users configure this fallback, especially if environment variables are unavailable.

crates/rqd/src/frame/cache.rs (1)

25-55: Be mindful of performance when replicating large caches.

into_running_frame_vec() creates a full clone of the cached data, which can be expensive for large workloads. If this method is called frequently or with big data sets, consider an incremental, streaming, or lazy approach to reduce overhead.

crates/rqd/src/frame/manager.rs (3)

38-42: Consider using an async runtime for concurrency rather than manual threads.

std::thread::spawn is effective for isolating frame logic, but using tokio::spawn or spawn_blocking might integrate better with async patterns, especially if you need to monitor or await tasks later.

171-176: Ensure panics are appropriately surfaced or rethrown.

Catching and logging panics helps prevent crashing the entire process, but you might still wish to surface them to higher-level contexts, e.g., to mark frames as “failed” or trigger further cleanup.

245-260: Expand error representation if needed.

Current error variants cover essential scenarios, but you may need more nuanced errors for advanced debugging or partial failure cases. Extending FrameManagerError can improve traceability and user feedback.

crates/rqd/src/servant/rqd_servant.rs (1)

41-43: Panic on setup failure.
Panicking immediately may be acceptable for a core daemon, but consider returning an error to allow graceful shutdown or retries if partial context is available.

crates/rqd/src/frame/running_frame.rs (4)

34-46: RunningFrame serialized & fields made public.
Serialization is needed for snapshots, but exposing many fields publicly can risk misuse. Consider documented getters/setters if further encapsulation is needed.

423-425: Docker run method is todo!().
Implementation is pending.

Would you like help drafting the Docker-based runner logic or opening an issue to track it?

432-459: spawn_logger with a separate thread monitoring raw outputs.
This approach works but you might consider an internal buffer or advanced log aggregator if performance or line ordering is critical.

686-705: snapshot_path creation.
Throws an error if PID is missing. This is correct but watch out for race conditions if PID is updated after partial initialization.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea8f8f3 and b964e46.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (23)

Cargo.toml (1 hunks)
config/rqd.fake_linux.yaml (2 hunks)
crates/dummy-cuebot/Cargo.toml (1 hunks)
crates/dummy-cuebot/src/main.rs (3 hunks)
crates/dummy-cuebot/src/report_servant.rs (2 hunks)
crates/dummy-cuebot/src/rqd_client.rs (1 hunks)
crates/opencue-proto/Cargo.toml (1 hunks)
crates/opencue-proto/build.rs (1 hunks)
crates/opencue-proto/src/lib.rs (1 hunks)
crates/rqd/Cargo.toml (2 hunks)
crates/rqd/src/config/config.rs (3 hunks)
crates/rqd/src/frame/cache.rs (1 hunks)
crates/rqd/src/frame/frame_cmd.rs (3 hunks)
crates/rqd/src/frame/logging.rs (1 hunks)
crates/rqd/src/frame/manager.rs (1 hunks)
crates/rqd/src/frame/mod.rs (1 hunks)
crates/rqd/src/frame/running_frame.rs (20 hunks)
crates/rqd/src/main.rs (3 hunks)
crates/rqd/src/servant/mod.rs (2 hunks)
crates/rqd/src/servant/rqd_servant.rs (3 hunks)
crates/rqd/src/servant/running_frame_servant.rs (1 hunks)
crates/rqd/src/system/machine.rs (1 hunks)
crates/rqd/src/system/mod.rs (0 hunks)

💤 Files with no reviewable changes (1)

crates/rqd/src/system/mod.rs

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build

🔇 Additional comments (74)

Cargo.toml (1)

7-7:
✅ Verification successful

Verify the resolver version

The Cargo.toml specifies resolver = "3", but as far as I'm aware, Cargo only supports resolver versions "1" and "2" in the current stable releases. Similarly, edition = "2024" on line 6 refers to a future Rust edition that hasn't been released yet (current is 2021).

🌐 Web query:
What are the valid Cargo resolver versions?
💡 Result:

Cargo's resolver versions determine how dependencies and features are handled during the build process. There are three resolver versions available, each with distinct behaviors and compatibility requirements:

Valid Resolver Versions

Resolver "1" (default for Rust editions 2015/2018):

Merges features across dependencies

Original behavior with simpler feature unification

MSRV: Works with Rust 1.51+ [1][2]

Resolver "2" (default for Rust 2021 edition):

Avoids feature unification in specific cases:

Platform-specific dependencies for non-target builds

Separation between normal/built-dependencies and dev-dependencies

Requires explicit workspace declaration in virtual workspaces [1][2][4]

Resolver "3" (default for Rust 2024 edition, requires Rust 1.84+):

Changes default behavior for resolver.incompatible-rust-versions

Falls back to MSRV-compatible versions when newer dependencies require a Rust version exceeding your MSRV [1][3]

Key Implementation Details

Specified via resolver = "N" in Cargo.toml or workspace definition

Workspace-level setting affecting all members

Later editions automatically opt-in to newer resolvers (e.g., 2021 → resolver 2)

MSRV considerations become more systematic with resolver 3 [1][3]

The resolver version impacts dependency resolution strategies and feature handling, with newer versions providing more precise control at the cost of potential duplicated builds when features diverge.

Citations:

1: https://doc.rust-lang.org/cargo/reference/resolver.html

2: https://doc.rust-lang.org/edition-guide/rust-2021/default-cargo-resolver.html

3: https://rust-lang.github.io/rfcs/3537-msrv-resolver.html

4: https://rustwiki.org/en/cargo/reference/features.html

5: Support different versions of the same dependency within a workspace rust-lang/cargo#13594

6: https://crates.io/crates/resolver
Verified: Cargo resolver and edition versions are valid

The Cargo.toml entries using resolver = "3" and edition = "2024" are now supported. Resolver version 3 is valid for Rust 2024 (requiring Rust 1.84+), so these settings are appropriate if your toolchain meets that prerequisite. Please confirm your environment is updated accordingly.
crates/dummy-cuebot/Cargo.toml (2)

23-23: Possible annotation error

According to the AI summary, itertools was supposed to be removed, but it appears as an added/modified line in the annotation. Please verify if this dependency should be added or removed.

31-31: Addition of users crate aligns with PR objectives

The addition of the users crate (which provides access to Unix users and groups information) makes sense for the frame snapshot feature, particularly when handling processes that become parented by init/systemd after rqd restarts.

crates/opencue-proto/src/lib.rs (1)

75-77: LGTM! Consistent implementation for resource_id

This implementation follows the same pattern as the existing ID methods, making for a consistent API. The method properly handles conversion errors by falling back to Uuid::nil().

crates/rqd/src/frame/mod.rs (1)

1-5: Well-structured module organization

The module structure provides a clean separation of concerns for frame management, with public modules (cache, manager, running_frame) exposing the necessary API while keeping implementation details internal (frame_cmd, logging). This organization aligns well with the PR objectives of implementing a snapshot mechanism for frames.

crates/opencue-proto/Cargo.toml (1)

21-22: Good addition of serialization dependencies

Adding Serde support is appropriate for implementing the frame snapshot functionality mentioned in the PR objectives. These dependencies will enable serialization and deserialization of protocol buffer types, which is essential for storing frame state to disk and recovering it when rqd restarts.
config/rqd.fake_linux.yaml (1)

13-14:
✅ Verification successful

New snapshot configuration path looks good

The new snapshots path configuration directly supports the PR objective of implementing a snapshot mechanism for frames. Using $HOME/.rqd/snapshots follows standard conventions for user-specific application data.

Make sure the code handles directory creation if it doesn't exist.

🏁 Script executed:
#!/bin/bash
# Check if the code creates the snapshot directory if it doesn't exist
rg "create_dir_all|mkdir" --type rust
Length of output: 226
Snapshot Directory Creation Verified – Approved

The new snapshots configuration using $HOME/.rqd/snapshots correctly follows user-data conventions and supports the PR objective. Our review confirms that the code handles directory creation, as evidenced by the use of fs::create_dir_all in both:

crates/rqd/src/frame/logging.rs

crates/rqd/src/servant/rqd_servant.rs

No further action is required.
crates/rqd/src/frame/logging.rs (1)

67-67: Code simplification improvement

Good refactoring of the loop condition. This change simplifies the code by eliminating a temporary variable and directly formatting the path string in the condition check, while maintaining the same functionality.
crates/rqd/src/main.rs (5)

7-7: Good module organization

The import change properly reflects the new architecture that uses a dedicated FrameManager to handle frame operations, which aligns with the PR's objectives for frame snapshot and recovery.

68-72: Well-structured FrameManager implementation

The FrameManager implementation is well-organized, encapsulating the configuration, frame cache, and machine monitor components. This structure aligns with the PR objective of improving frame management and recovery.

84-84: API update maintains consistency

The update to use frame_manager instead of running_frame_cache in the servant call maintains architectural consistency by centralizing frame management through the FrameManager.

43-43:
✅ Verification successful

Verify RunningFrameCache initialization

The initialization of running_frame_cache no longer wraps it in an Arc directly. Make sure that the RunningFrameCache::init() method returns an Arc internally, or that this change is consistent with how it's used throughout the codebase.

🏁 Script executed:
#!/bin/bash
# Check the definition of RunningFrameCache::init
rg "impl RunningFrameCache" -A 20
Length of output: 1544
Action: Confirm RunningFrameCache Initialization

Upon verifying the implementation of RunningFrameCache::init(), it has been confirmed that it returns an Arc<Self> (located in crates/rqd/src/frame/cache.rs). This update is consistent with its usage in main.rs, and no further changes are required.

80-81:
✅ Verification successful

Important snapshot recovery feature

This is a key implementation of the PR objective for frame recovery. The recover_snapshots call reads snapshots from the configured path and attempts to reconnect to frames after rqd restarts.

Verify error handling for snapshot recovery failures.

🏁 Script executed:
#!/bin/bash
# Check how errors are handled in the recover_snapshots implementation
rg "fn recover_snapshots" -A 20
Length of output: 1581
Snapshot Recovery Error Handling Verified

The implementation in crates/rqd/src/frame/manager.rs properly handles errors during snapshot recovery. The recover_snapshots function maps errors from reading the snapshot directory to a descriptive error message, logs a warning, and returns an error using miette!. This confirms that failures in reading snapshots are correctly managed, aligning with the PR's objectives for robust frame recovery.
crates/rqd/src/system/machine.rs (1)

17-20: LGTM! Import reorganization looks good.

The restructuring of imports reflects a shift from direct dependency on running_frame::RunningFrameCache to a more modular approach with frame::cache::RunningFrameCache. This aligns with the snapshot frame implementation mentioned in the PR objectives.

Also applies to: 23-23

crates/rqd/Cargo.toml (3)

18-18: New dependency added for serialization.

Adding bincode makes sense for serializing frame snapshots as mentioned in the PR objectives.

36-36: LGTM! Added serde feature for UUID.

Enabling the serde feature for UUID is necessary to properly serialize/deserialize frames with UUIDs in the snapshot system.

44-44: LGTM! Process and signal features added.

The addition of process and signal features to the nix crate aligns perfectly with the PR's goal of monitoring process IDs and capturing exit signals during frame recovery.

crates/rqd/src/servant/mod.rs (1)

2-2: Migrated from RunningFrameCache to FrameManager.

This is a significant architectural improvement that aligns with the PR objectives. The shift from a simple cache to a dedicated manager enables the new snapshot mechanism for frames and supports the recovery process when rqd restarts.

Also applies to: 20-24, 28-29

crates/dummy-cuebot/src/report_servant.rs (2)

1-13: LGTM! Import reorganization and additions for server functionality.

The imports have been restructured to better support the server functionality introduced in this PR.

58-71: Well-implemented server functionality for testing.

The addition of DummyCuebotServer with the start_server method provides a clean interface for testing the frame snapshot and recovery functionality described in the PR objectives.

The server implementation correctly:

Binds to all interfaces (0.0.0.0)

Configures the RqdReportInterfaceServer

Handles errors appropriately with diagnostic information

crates/rqd/src/frame/frame_cmd.rs (4)

6-7: New field added to support exit code capturing.

This field is part of the snapshot mechanism implementation mentioned in the PR objectives, allowing the frame to save its exit status to a file for later recovery.

15-18: LGTM - FrameCmdBuilder initialization updated.

Properly initializes the new exit_file field to None by default.

21-41: Enhanced build method with exit code capture functionality.

The build method now properly handles the exit file path by creating a command that saves the exit code to a file. This is a key part of the snapshot mechanism that enables frame recovery after rqd restarts.

The implementation captures the exit code of the command and writes it to the specified file before exiting with the same code, which will allow the recovery process to read the exit status accurately.

99-102: Added builder method for exit file path.

Follows the builder pattern consistently with the rest of the class.

crates/opencue-proto/build.rs (1)

48-66: Added Serde serialization support for protocol buffer types.

These changes enable serialization/deserialization for various types that will be used in the snapshot mechanism. This is necessary for saving frame state to disk and recovering it later.

The types being serialized are all related to process statistics and frame running state, which aligns with the PR objective of implementing a snapshot mechanism for frames.
crates/rqd/src/config/config.rs (3)

86-86: Added Serialize derive for RunnerConfig.

This enables the RunnerConfig struct to be serialized, which is needed for the snapshot feature.

99-99: Added snapshots_path configuration option.

This new field supports the snapshot mechanism by allowing configuration of where frame snapshots will be stored.

119-119:
❓ Verification inconclusive

Changed run_as_user default to false.

This change may affect how frames are executed. Ensure this change is intentional and consistent with the overall design.

🏁 Script executed:
#!/bin/bash
# Check if there are any direct uses of this configuration value in the codebase
rg "run_as_user" --type rust
Length of output: 428
Attention: Verify Impact of run_as_user Default Change

The new default (false) in crates/rqd/src/config/config.rs has been confirmed, and its usage appears consistent across the codebase:

In crates/rqd/src/frame/running_frame.rs, the flag is checked and explicitly set to false.

In crates/dummy-cuebot/src/main.rs, it’s also referenced for configuration.

Please verify that this behavior aligns with the overall design and that no component relies on a different default. Double-check that any logic previously expecting a different default is updated accordingly, and ensure the relevant tests cover these execution paths.
crates/rqd/src/servant/running_frame_servant.rs (3)

9-9: Updated import to use FrameManager instead of RunningFrameCache.

This change is part of the architectural shift from a cache-based approach to a more centralized frame management system.

13-13: Replaced RunningFrameCache with FrameManager.

The servant now uses the new FrameManager, aligning with the PR objective of implementing a snapshot mechanism for frames.

17-18: Updated initialization to use FrameManager.

The initialization is properly updated to use the new FrameManager structure.

crates/dummy-cuebot/src/rqd_client.rs (1)

65-73: Looks good.

Locking the client, building the request, and error handling with into_diagnostic and wrap_err are well-structured. This design cleanly surfaces any connectivity issues to the caller.

crates/rqd/src/frame/cache.rs (1)

16-21: Cache initialization looks good.

Using DashMap::with_capacity(30) provides a sensible initial capacity. The code is clear, and the concurrency approach with DashMap is appropriate.

crates/rqd/src/servant/rqd_servant.rs (8)

1-3: Imports look consistent and necessary.
These additions for fs, path, and Arc align with the usage in this file.

6-6: New import for FrameManager.
This import reflects the refactor to manage frames through a dedicated manager.

10-10: Additional gRPC-related imports.
The code pulls in request/response types and the rqd_interface_server::RqdInterface trait. This is consistent with implementing the rqd gRPC interface.

Also applies to: 22-22

24-24: async_trait import from tonic.
This is standard for asynchronous service trait implementations.

32-32: Added frame_manager field to RqdServant.
Storing Arc<FrameManager> is a clean way to coordinate frame lifecycle.

39-39: Constructor parameter for frame_manager.
Accepting Arc<FrameManager> centralizes frame management at init time, improving modularity.

51-58: setup function ensures snapshot path.
Creating the directory if missing is straightforward. Error handling is already bubbled up. Code looks fine.

95-95: Spawning a frame via frame_manager.
Delegating frame spawning to the manager helps maintain a single source of truth for frame lifecycle. Good approach.

crates/dummy-cuebot/src/main.rs (9)

1-1: Additional imports and module references.
The new imports (HashMap, FromStr, miette, report_servant, rqd_client) are aligned with the updated functionality for server-client CLI.

Also applies to: 3-3, 4-4, 8-8

23-30: ReportServerCmd now enforces a default port.
Using an explicit default value adds clarity and avoids handling Option<u16>. No issues.

35-41: hostname field with a default of "localhost".
Requiring a non-optional hostname is a solid approach.

43-49: Rqd client port with a default of 4344.
Similar pattern to ReportServerCmd. Straightforward.

61-67: New LaunchFrameCmd struct.
Captures command, environment, and user-run flag. This expands functionality elegantly.

69-70: EnvVars wrapper struct.
Encapsulating environment variables in a dedicated type can help keep logic organized.

72-89: FromStr for EnvVars.
Parsing comma-separated key=value pairs is handled neatly. The error messages are also appropriately descriptive.

95-96: Starting the DummyCuebot server.
The call to DummyCuebotServer::start_server is straightforward. No issues spotted.

98-100: Constructing client and launching frame.
Building DummyRqdClient and passing environment variables from EnvVars meets the new structure’s requirements and is well-structured.

Also applies to: 101-117

crates/rqd/src/frame/running_frame.rs (24)

6-6: Added Read to imports.
Used for reading data from files or streams. Fits the new exit-file reading logic.

16-16: ExitStatusExt and mpsc introduced.
ExitStatusExt to retrieve signals on Unix, and mpsc for logging thread communication. Both are valid.

19-25: Tracing, command builder, and serialization imports.
These additions (tracing, FrameCmdBuilder, Serialize, Deserialize, sysinfo) are central to the new snapshot and logging functionalities.

30-30: Importing local logging components.
FrameLogger and FrameLoggerBuilder unify how logs are streamed.

50-53: Exit file path & custom (de)serialization hooks.
Storing exit status externally and skipping the thread handle in snapshots are sensible for the new recovery logic.

56-56: RunningState gains exit_signal and custom skip logic.
Including the exit signal helps differentiate normal vs. signal-based termination. The skip logic for the thread handle is well-executed.

Also applies to: 60-66

72-72: Default for exit_signal.
Ensures the field is initialized even absent signals. Straightforward.

77-97: FrameStats now cloned and serialized.
This helps capture state in snapshots. The usage of sysinfo::ChildrenProcStats fits well with the broader resource monitoring approach.

140-166: Initialization with resource_id and exit file path.
Appending resource_id to logs ensures uniqueness. Verify that resource_id is always set.

188-203: Doc comments for update_launch_thread_handle.
Clear explanation, ensuring the concurrency aspect is understood.

204-219: update_exit_code_and_signal method.
Setting both exit code and signal closes the gap for signal-based terminations.

280-325: run method handles recovery vs. normal flow.
Snapshot clearing is prudent. Consider concurrency if run is called multiple times.

328-421: run_inner method with updated exit code logic.
Snapshot after spawn ensures we can recover if RQD crashes mid-execution. Overall flow is coherent.

461-505: recover_inner for resuming frames.
Reading the exit file and waiting for the process are good strategies.

516-517: pid() accessor.
Provides optional process ID. Straightforward concurrency usage of Mutex.

539-576: read_exit_file to parse the final exit code.
Error handling is robust; reusing the 128+signal pattern is consistent with Unix conventions.

593-623: wait function polls for process termination.
Using kill(pid, None) plus a 500ms sleep is typical on Unix. Fine for basic usage.

707-719: snapshot uses bincode for persistence.
Simple approach. If partial or concurrent writes are possible, consider a more atomic solution.

721-724: clear_snapshot removes snapshot file.
Method is direct. The caller handles ignoring errors if the snapshot doesn’t exist.

748-780: from_snapshot rehydrates RunningFrame.
Properly ensures the process is still alive and updates with the new config.

782-790: is_process_running using sysinfo.
Partial refresh of a single PID is a good optimization.

849-859: Custom serialization for Arc<Mutex<RunningState>>.
Skipping the thread handle is a clever workaround to avoid non-serializable data. Well handled.

Also applies to: 861-877, 879-889

900-901: Additional internal imports in tests.
Refers to RunnerConfig and FrameLogger. Harmless enhancements.

Also applies to: 902-902

965-967: Augmented test assertions.
These changes check both exit code and signal, ensuring coverage of normal and failing scenarios.

Also applies to: 971-974, 985-988, 1006-1007, 1010-1011, 1012-1015, 1030-1031, 1033-1037, 1050-1051, 1052-1056, 1073-1074, 1080-1084, 1100-1101, 1108-1112, 1126-1127, 1128-1133

config/rqd.fake_linux.yaml

crates/rqd/src/config/config.rs

crates/rqd/src/frame/manager.rs

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

crates/dummy-cuebot/src/main.rs (3)

69-89: Environment variables parsing has limitations.

The current implementation of EnvVars has a limitation where it doesn't support environment variables with = in their values, which is a valid use case (e.g., PATH=/usr/bin:/usr/local/bin).

Consider using a more robust parsing approach that can handle values containing equals signs:

-        for pair in s.split(',') {
-            let parts: Vec<&str> = pair.split('=').collect();
-            if parts.len() != 2 {
-                return Err(format!("Invalid key-value pair: {}", pair));
-            }
-
-            map.insert(parts[0].trim().to_string(), parts[1].trim().to_string());
+        for pair in s.split(',') {
+            // Find the first equals sign
+            if let Some(idx) = pair.find('=') {
+                let key = pair[..idx].trim().to_string();
+                let value = pair[idx+1..].trim().to_string();
+                map.insert(key, value);
+            } else {
+                return Err(format!("Invalid key-value pair (missing '='): {}", pair));
+            }

95-96: Unnecessary clone on primitive type.

The clone() on a primitive u16 type is unnecessary as it implements Copy.

-                DummyCuebotServer::start_server(report_server_cmd.port.clone()).await
+                DummyCuebotServer::start_server(report_server_cmd.port).await

98-118: Optimize client initialization and refactor for readability.

Several improvements could be made to this section:

Avoid unnecessary cloning
Extract the environment variable handling for better readability
Refactor conditional UID extraction for clarity

-                let client =
-                    DummyRqdClient::build(rqd_client_cmd.hostname.clone(), rqd_client_cmd.port)
-                        .await?;
-                match &rqd_client_cmd.api_method {
-                    ApiMethod::LaunchFrame(launch_frame_cmd) => {
-                        let uid = launch_frame_cmd
-                            .run_as_user
-                            .then(|| users::get_current_uid());
-                        client
-                            .launch_frame(
-                                launch_frame_cmd.cmd.clone(),
-                                launch_frame_cmd
-                                    .env
-                                    .clone()
-                                    .unwrap_or(EnvVars(HashMap::new()))
-                                    .0,
-                                uid,
-                            )
-                            .await
-                    }
+                let client = DummyRqdClient::build(&rqd_client_cmd.hostname, rqd_client_cmd.port)
+                    .await?;
+                match &rqd_client_cmd.api_method {
+                    ApiMethod::LaunchFrame(launch_frame_cmd) => {
+                        // Extract UID if running as user
+                        let uid = if launch_frame_cmd.run_as_user {
+                            Some(users::get_current_uid())
+                        } else {
+                            None
+                        };
+                        
+                        // Prepare environment variables
+                        let env_vars = launch_frame_cmd.env
+                            .as_ref()
+                            .map_or_else(
+                                || HashMap::new(),
+                                |env| env.0.clone()
+                            );
+                            
+                        client
+                            .launch_frame(&launch_frame_cmd.cmd, env_vars, uid)
+                            .await
+                    }

Note: The refactoring assumes the DummyRqdClient::build and launch_frame methods can be updated to accept references where appropriate.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b964e46 and efaff54.

📒 Files selected for processing (4)

config/rqd.fake_linux.yaml (2 hunks)
crates/dummy-cuebot/src/main.rs (3 hunks)
crates/rqd/src/config/config.rs (4 hunks)
crates/rqd/src/frame/manager.rs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

config/rqd.fake_linux.yaml
crates/rqd/src/config/config.rs
crates/rqd/src/frame/manager.rs

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build

🔇 Additional comments (4)

crates/dummy-cuebot/src/main.rs (4)

1-1: Updated imports to support new functionality.

The addition of HashMap and FromStr is appropriate for the new environment variables functionality.

23-31: Good simplification of ReportServerCmd port handling.

Changing from Option<u16> to required u16 with a default value removes unnecessary option handling and makes the code more straightforward.

34-53: Improved type safety and simplicity in RqdClientCmd.

Converting port from Option<String> to u16 with a default value improves type safety and removes unnecessary option handling. The hostname field change similarly simplifies the code.

61-67: New LaunchFrameCmd structure provides clear configuration options.

The new structure cleanly encapsulates all the parameters needed for launching a frame, with appropriate documentation on the command line options.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

crates/rqd/src/frame/running_frame.rs (1)
140-186: 🛠️ Refactor suggestion

Consider ensuring snapshot path creation
Lines 168-169 comment “Ensure snapshot path exists,” but no implementation is provided. If the directory is missing, snapshot creation will fail.

Example fix:
 use std::fs;
 ...
 // Ensure snapshot path exists
 fs::create_dir_all(&config.snapshots_path).into_diagnostic()?;

🧹 Nitpick comments (7)

crates/rqd/src/frame/manager.rs (4)
117-164: Consider asynchronous I/O for reading snapshot files.

Performing blocking I/O calls to read large snapshot directories in a potentially async environment can cause performance bottlenecks. Using tokio::fs or an equivalent approach might improve scalability.

166-185: Mixing thread spawn with async environment.

This manual thread creation and capturing panics is acceptable when heavy blocking tasks are required. If more integration with async is desired, consider switching to tokio::spawn_blocking for consistency and centralized runtime management.

190-190: Fix the minor typo in error message.

"Not lauching" should be "Not launching" for better clarity.
-                "Not lauching, frame is already running on this host {}",
+                "Not launching, frame is already running on this host {}",
225-243: Optional logging of hardware state.

Consider logging the current hardware state for easier debugging if frame spawn is blocked because of the host status.
crates/rqd/src/frame/running_frame.rs (3)

34-54: Public fields may reduce encapsulation
Exposing so many fields with pub can reduce encapsulation. If external usage does not strictly require direct field access, consider using getters/setters or restricting visibility.

425-427: run_on_docker is unimplemented
This placeholder remains a todo!(). Let me know if you’d like assistance implementing or if this is intended for a future PR.

429-432: Windows run_inner is unimplemented
Same as docker: currently stubbed out with todo!(). Happy to help provide an implementation or open a new issue if needed.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efaff54 and d27c3a0.

📒 Files selected for processing (5)

crates/rqd/src/config/config.rs (7 hunks)
crates/rqd/src/config/error.rs (1 hunks)
crates/rqd/src/frame/manager.rs (1 hunks)
crates/rqd/src/frame/running_frame.rs (21 hunks)
crates/rqd/src/servant/rqd_servant.rs (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

crates/rqd/src/config/config.rs

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build

🔇 Additional comments (33)

crates/rqd/src/config/error.rs (1)

13-15: Well-defined error variant for path issues.

Adding the InvalidPath(String) variant effectively distinguishes misconfiguration scenarios. This helps ensure better error messaging and handling when paths are invalid.

crates/rqd/src/servant/rqd_servant.rs (4)

1-2: No immediate concerns with new imports.

These imports (std::fs and std::path::Path) might be used for I/O operations. They appear safe to include.

32-32: Use of Arc<FrameManager> looks correct.

Replacing the old cache reference with a FrameManager reference centralizes frame handling. This aligns with the design direction of encapsulating frame logic in one place.

39-39: Clear function signature reflecting new field.

Exchanging the running_frame_cache parameter with frame_manager in init is consistent with the new architecture. Ensure that all usage sites were updated accordingly.

83-83: Async spawn call for running frames is well-placed.

Invoking self.frame_manager.spawn(run_frame).await? centralizes frame launch logic. Ensure any resulting error is properly surfaced in logs or user-facing messages.

crates/rqd/src/frame/manager.rs (1)

246-270: Error definitions look comprehensive.

Your FrameManagerError variants neatly map to gRPC statuses. This design clarifies failure cases and ensures consistent handling across the codebase.

crates/rqd/src/frame/running_frame.rs (27)

6-6: New imports look appropriate
All newly introduced imports are consistent with the subsequent code usage. No immediate concerns regarding security, correctness, or performance.

Also applies to: 16-16, 19-19, 21-21, 23-23, 24-24, 26-26, 27-27, 30-30

56-74: RunningState structure changes
The addition of an exit_signal field and usage of #[serde(skip_serializing)] for the thread handle look correct. No glaring concurrency or correctness issues identified.

77-97: FrameStats updates
The structure alterations appear to be straightforward expansions of resource-tracking fields. No issues found.

190-205: Storing launch thread handle
Using a dedicated method to store the join handle is a clear approach for lifecycle management. No further concerns.

206-221: Update exit code and signal
This design neatly centralizes exit state updates. Looks good as implemented.

271-327: run method introduces recovery logic
The logic distinguishing regular run vs. recover mode looks coherent. Error handling is properly logged.

330-423: run_inner method
Spawning the command, capturing logs, and snapshotting post-spawn all align well. Error handling is improved with informative diagnostics.

434-461: spawn_logger concurrency
Creating a background thread to pipe stdout/stderr is well-structured. Properly receiving a signal to terminate ensures it won’t block indefinitely.

463-507: recover_inner logic
Good approach for reattaching logs and waiting until the process completes. Graceful fallback to signal-based assumption of exit if reading the exit file fails.

509-519: pid retrieval
Straightforward locking to retrieve internal PID. No concurrency pitfalls spotted here.

521-578: read_exit_file handling
Opening and parsing the file with robust error messages is good. Using exit_code < 128 vs. ≥ 128 logic to detect signals is correct.

580-625: wait method
Polling with kill(..., None) is a reasonable fallback. The 500ms sleep avoids busy looping extremes. Implementation appears sound.

688-720: snapshot & snapshot_path
The serialization approach looks good. As noted earlier in init, ensure the directory is created before writing snapshots to disk.

722-725: clear_snapshot
Simply removing the file after successful run or recovery is concise and safe.

783-790: is_process_running
Using the sysinfo crate to verify an active process is a convenient approach.

849-853: none_value helper
Simple function returning None. No issues.

854-878: serialize_running_state
Skipping the thread handle during serialization is correct to avoid undefined behavior. Nicely handled fallback on poison.

880-889: deserialize_running_state
Wrapping the deserialized RunningState in Arc<Mutex<...>> is nicely done.

901-903: Test imports
Introducing config and logging modules for testing looks consistent with test usage.

967-979: test_run_logs_stdout_stderr
Checks for exit code and verifies stdout/stderr appear in the logger. Straightforward test coverage.

984-992: test_run_failed
Verifies non-zero exit code for a failing command. Nicely done.

996-1019: test_run_multiline_stdout
Ensures multiple lines are captured in order. Also confirms exit file content.

1024-1041: test_run_env_variables
Appropriately checks that environment variables are respected.

1054-1061: test_run_command_not_found
Verifies non-zero exit code for a missing command. Good negative test scenario.

1077-1088: test_run_sleep_command
Ensures a minimal sleep command runs the expected duration. Straightforward.

1093-1116: test_run_interleaved_stdout_stderr
Correctly checks interleaved logs for correctness.

1118-1137: test_run_with_special_characters
Good coverage for environment variables containing special symbols.

crates/rqd/src/frame/manager.rs

crates/rqd/src/frame/running_frame.rs

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

crates/rqd/src/system/linux.rs (1)

541-599: 🛠️ Refactor suggestion

Robust filtering but watch out for partial reservations.
The reserve_cores logic filters available cores effectively, but note that partial reservations might stay allocated if an error interrupts after partially reserving some cores. A transaction-like approach or final rollback mechanism could avoid orphaned reservations.

🧹 Nitpick comments (14)

crates/rqd/src/frame/manager.rs (4)

124-171: Validate parallel recoveries or concurrency.
The recover_snapshots method effectively iterates over snapshots and spawns frames. In scenarios with many concurrent snapshots, consider potential concurrency or partial recovery. For instance, failing on one snapshot currently doesn’t halt the entire process, which is good, but you might want to ensure the method is safe under high concurrency if multiple calls to recover_snapshots are possible.

173-191: Consider tokio spawning for asynchronous scheduling.
spawn_running_frame uses a blocking thread. This is perfectly fine for certain workloads, but if you plan to integrate with a larger async ecosystem, consider using tokio::task::spawn_blocking or an async runtime approach to keep the system responsive.

232-250: Optional: store previous hardware states for troubleshooting.
The machine validation is concise. If debugging frequent or intermittent hardware state transitions, consider logging or storing previous states to help track changes over time.

253-265: Expand error coverage if needed.
The FrameManagerError covers core scenarios well. If you anticipate partial resource allocation or cleaning issues, consider adding specialized variants to detail those cases.

crates/rqd/src/system/linux.rs (1)

459-475: Validate concurrency if multiple threads call reserve_core.
This function correctly reduces available_cores and updates reserved data structures. Ensure external concurrency is well-managed if multiple calls might interfere.

crates/rqd/src/frame/running_frame.rs (9)

36-53: Consider encapsulating new public fields.
Exposing all fields as pub could increase coupling and reduce the maintainability of your code, especially for fields like exit_file_path which might need validation or sanitization. If external read/write access is necessary, consider using getter/setter methods or implementing pub(crate) where possible.

146-162: Potential risk in path generation.
Appending resource_id to paths is functional, but watch for collisions or unexpected special characters from user input. Consider sanitizing or validating the resource ID if it can come from untrusted sources.

271-327: Check error handling flow in run().
When recover_mode is true, any spawn-related errors bubble up, but the code logs the error and returns. Confirm this aligns with higher-level expectations (e.g., does the system handle partial recoveries gracefully?).

330-423: Comprehensive implementation for run_inner.

Good approach to log final exit code.

Consider ensuring that snapshot() success/failure is also logged more distinctly for easier debugging.

Potential improvement: timeouts on child.wait() to avoid indefinite blocking if the process hangs.

425-427: run_on_docker remains unimplemented.
If Docker support is planned, consider returning an explicit error or removing the stub if it's not on the roadmap.

429-432: Windows branch is unimplemented.
Same recommendation as Docker: either implement or remove the placeholder.

434-462: spawn_logger concurrency check
Ensure the log thread terminates properly even if the channel signal is delayed or never sent (e.g., in panic scenarios). Right now, it loops indefinitely, which is correct but can leak resources if the signal is never received.

580-625: Polling-based wait might block indefinitely.
If a process fails to exit, thread::sleep(..) loop runs forever. Consider a timeout or user-configurable max wait period.

688-726: Snapshot logic improvement.

snapshot_path(): rely on the same directory creation approach to prevent I/O errors.

snapshot(): consider versioning or controlling snapshot size.

clear_snapshot(): good to remove ephemeral data, but confirm concurrency safety if multiple frames share the same path naming scheme.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d27c3a0 and 0adb563.

📒 Files selected for processing (4)

crates/rqd/src/frame/manager.rs (1 hunks)
crates/rqd/src/frame/running_frame.rs (21 hunks)
crates/rqd/src/system/linux.rs (13 hunks)
crates/rqd/src/system/machine.rs (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build

🔇 Additional comments (36)

crates/rqd/src/frame/manager.rs (3)

14-18: Struct looks clean and succinct.
The FrameManager struct neatly declares the required fields for configuration, frame caching, and host interaction. This design is clear and maintainable.

193-230: Input checks are well-handled.
This method provides clear validations for frames, including non-root enforcement and fractional core rejections. The structure is straightforward and helps avoid invalid states early on.

267-276: Error-to-status translation looks correct.
Mapping custom errors to gRPC status codes is comprehensive and aligns with typical usage of tonic::Status.

crates/rqd/src/system/linux.rs (6)

2-59: Minor import and struct updates.
Changes in import statements and supporting code look fine, reflecting new types and references without apparent issues.

Also applies to: 867-873

883-894: Test approach is straightforward.
test_reserve_cores_success verifies correct allocation. The usage is concise and ensures basic coverage.

897-907: Good negative test coverage.
test_reserve_cores_insufficient_resources properly confirms that the function rejects requests surpassing available resources.

910-919: Full allocation confirmed.
test_reserve_cores_all_available ensures the function handles exact capacity requests. Coverage is appropriate.

922-942: Affinity distribution validated.
test_reserve_cores_affinity checks that selected cores are grouped. This is helpful to confirm the intended scheduling logic.

945-964: Sequential reservations tested thoroughly.
test_reserve_cores_sequential_reservations ensures multiple calls to reserve_cores behave as expected. This is a solid scenario test.

crates/rqd/src/system/machine.rs (5)

4-4: Import additions look good.
The added imports for tracing, UUID, and time utilities align with the new resource reservation logic and logging needs.

Also applies to: 16-17, 21-21, 25-25

181-201: Machine trait implementation: hardware reporting is concise.
The hardware_state method returning an optional state from RenderHost is a neat approach. No major issues are apparent.

220-226: Frame ID usage ensures resource tracking.
reserve_cpus now accepts frame_id, helping tie reservations to specific frames. This is a solid improvement for debugging and accountability.

227-241: Release loop handles partial errors gracefully.
Releasing each core individually logs errors without blocking subsequent releases. This approach helps avoid leftover leaks whenever partial releases fail.

354-382: CoreReservation encapsulation is beneficial.
Storing reserver_id and maintaining a start_time can facilitate advanced scheduling or cleanup tasks. The methods insert and remove are straightforward.

crates/rqd/src/frame/running_frame.rs (22)

6-6: No concerns with newly added imports.
They correctly bring in needed libraries (e.g., sysinfo, serde, miette) without extraneous dependencies.

Also applies to: 16-16, 19-19, 21-21, 23-24, 26-27, 30-30

60-63: Confirm concurrency handling for launch_thread_handle.
Marking the launch_thread_handle as skipped in serialization is appropriate. However, ensure that concurrency scenarios are handled (e.g., if the handle is never joined, leftover threads might continue running).

77-97: Looks good for FrameStats serialization.
Making FrameStats serializable is useful for snapshotting. Verify that large output from GPU or memory stats does not bloat snapshot files.

140-140: Validate resource_id usage.
The new resource_id field is read from the request but not validated. Ensure the user input is trusted or safely sanitized before using it in file paths.

190-198: Good documentation for thread handle.
Clear docstring describing usage of the launch thread handle. No functional issues spotted.

206-214: Update exit code & signal with caution.
Ensure this is called only when the process has truly exited, and confirm concurrency is avoided if multiple threads might call it.

220-220: No issue with setting the exit signal.
Storing the signal can be valuable for debugging abrupt terminations.

262-262: No concerns with static fallback PATH.
This is a reasonable fallback for non-Windows systems.

463-507: Recovery logic is robust, but watch for partial states.

If the process unexpectedly died before RQD restarted, we handle exit code from the file or default to (1, Some(143)). Consider more granular checks if the process died with a different signal.

Make sure logs reflect partial or incomplete states for better debugging.

509-519: PID accessor is straightforward.
No major issues. Just ensure it's not used for active process checks without verifying the process is alive (i.e. is_process_running).

521-579: Robust read_exit_file.

Observations

Parsing exit codes >=128 as signal-based is standard.

Watch for negative or extremely large values in the file that might indicate corruption.

727-788: [Duplicate of previous PID reuse concern].
PID reuse remains a potential bug if the OS cycles through PIDs quickly. The code checks only whether the PID is alive, not whether it’s the same process. Consider storing start time or another unique token in the snapshot to confirm the process identity.

855-899: Custom serialization for Arc<Mutex<RunningState>>.
Skipping the thread handle ensures safe (de)serialization. Confirm that ignoring the thread handle on restore is always acceptable (i.e., no critical logic is lost).

907-910: New test imports appear valid.
No issues; they match the newly added tests.

977-984: Enhanced check for exit status in test_run_logs_stdout_stderr.
Validates both normal exit and correct logging. Good coverage.

998-998: test_run_failed checks exit code (1).
Verifies a failing command returns 1. This is an important negative test.

1016-1025: test_run_multiline_stdout:
Nice check of multiple lines plus verifying the exit code file.

1042-1047: test_run_env_variables:
Asserting environment variables are echoed. Confirms the environment setup logic.

1060-1066: test_run_command_not_found:
Confirms a non-zero exit code for invalid commands. Good negative-path test.

1083-1093: test_run_sleep_command:
Verifies timing, ensures the command is indeed run. Nice coverage.

1110-1122: test_run_interleaved_stdout_stderr:
Checks interleaved output is captured in logs. Great stress test for concurrency in logging.

1136-1143: test_run_with_special_characters:
Useful test to confirm environment variable handling. No concerns.

crates/rqd/src/frame/manager.rs

crates/rqd/src/frame/running_frame.rs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

crates/rqd/src/frame/running_frame.rs (4)

269-325: run() method logic is sound.
The option to recover or launch a fresh frame, and then clear snapshots if the frame was successfully spawned, is well-structured. A minor enhancement could be to snapshot again post-completion for debugging, but this is not critical.

328-421: Consider adding a configurable timeout in run_inner().
While the current loop design is sufficient for many scenarios, introducing a way to automatically terminate or time out a job can prevent indefinite resource usage. Otherwise, this logic—updating the PID, spawning the child, saving snapshots, logging, and retrieving the exit status—looks correct.

423-425: run_on_docker is unimplemented.
If you’d like help completing this Docker integration or want to track it, feel free to open an issue.

427-431: Windows runner is unimplemented.
Similar to Docker, if you plan on supporting Windows, consider opening a follow-up issue or PR to flesh out this method.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0adb563 and 2daa810.

📒 Files selected for processing (1)

crates/rqd/src/frame/running_frame.rs (21 hunks)

🔇 Additional comments (25)

crates/rqd/src/frame/running_frame.rs (25)

6-6: All newly added import statements look good.
These additions (e.g., ExitStatusExt, mpsc, tracing macros, Serde traits, etc.) are necessary for the enhanced process-management and serialization logic.

Also applies to: 16-16, 19-19, 21-21, 23-23, 24-24, 26-26, 27-27, 30-30

56-75: Serialization of RunningState looks correct.
The addition of exit_signal: Option<i32> and skipping the thread handle during serialization is a neat approach. This ensures that any runtime-specific data (e.g. JoinHandle) is not inadvertently persisted.

77-97: FrameStats serialization is fine.
Exposing GPU/CPU stats fields for post-run analysis or recovery makes sense. The changes look consistent and safe.

140-183: Ensure output directories exist for log and exit files.
These lines introduce exit_file_path (and also refine raw_stdout_path/raw_stderr_path) but do not guarantee that the request.log_dir directory is present. This can lead to runtime errors if the directory is missing. Consider calling std::fs::create_dir_all(...) before file creation.

188-203: Addition of update_launch_thread_handle is well-documented.
Storing the JoinHandle for monitoring or later joins is a clean approach. No concerns here.

204-219: Exit code & signal updates are correctly implemented.
The docstring is also clear on usage. This straightforward approach properly encapsulates termination details.

432-459: Logger thread spawning approach looks good.
Using a channel to signal completion and safely remove raw log files prevents leftover artifacts.

461-505: Recovery logic is straightforward and consistent.
Falling back to (1, Some(143)) (SIGTERM) if reading the exit file fails is a decent default; it signals the frame was likely killed.

507-517: pid() accessor implementation is fine.
The lock-based retrieval ensures thread-safe access to the stored PID.

519-576: Validate assumption for exit codes ≥128 as signals.
Posix-based systems typically encode signals in exit codes as 128 + signal. However, some processes may return codes >128 for other reasons. If that possibility is relevant, consider clarifying or adjusting the logic.

593-623: Polling with kill(nix_pid, None) is acceptable.
This loop with a 500ms sleep is a standard approach for checking process liveness. No issues found.

686-723: Directory creation for snapshots remains unaddressed.
The snapshot path is created via File::create but never ensures that self.config.snapshots_path exists. As previously mentioned, use create_dir_all(&self.config.snapshots_path) or an equivalent solution.

725-785: PID uniqueness caveat is documented.
You have a TODO for verifying PID reuse in from_snapshot. This acknowledges the past review comment. Implementation or process start-time checks can mitigate incorrect reattachments.

787-794: is_process_running logic is straightforward.
Relying on sysinfo to query the presence of the PID is a clean approach.

853-895: Custom (de)serialization for Arc<Mutex<RunningState>> is well-handled.
Skipping the thread handle during serialization prevents runtime handle leaks. Good use of #[serde(skip_...)].

905-908: New test imports are correct.
They support the refined test suite for reading exit files, environment variables, etc.

919-921: Config setup in tests is fine.
This ensures that RunnerConfig can be customized for each test scenario.

975-982: Test verifies reading exit file for successful runs.
These assertions confirm correctness of the new exit-file logic.

989-998: test_run_failed covers non-zero exit codes.
Testing a failure scenario ensures robust exit-code handling.

1003-1023: test_run_multiline_stdout is thorough.
Verifying multiple lines and matching the exit file’s status covers typical console output behaviors.

1038-1045: test_run_env_variables checks environment correctness.
Ensuring environment variable injection is validated helps confidence in production usage.

1058-1063: test_run_command_not_found ensures non-zero exit code.
Good negative test scenario to confirm error handling.

1069-1092: test_run_sleep_command verifies timing logic.
Assessing command duration is a nice addition. No concerns.

1108-1120: test_run_interleaved_stdout_stderr ensures correct ordering.
Confirming both streams show up as expected is a great coverage scenario.

1134-1141: test_run_with_special_characters expands coverage.
Verifying unusual characters in environment variables helps prevent unexpected parsing failures.

crates/rqd/src/frame/running_frame.rs

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

crates/rqd/src/frame/manager.rs (1)
21-123: 🛠️ Refactor suggestion

Refined spawn logic.

Good check for nimby locks and hardware state.

Clean user creation flow with meaningful error messages.

Resource rollback is partially handled for GPU-allocate failures; however, CPU rollback after line 62 warns about potential resource leaks if more errors occur post-CPU reservation.
 // Pseudocode to unify all resource rollbacks if something fails:
 let cpu_list = // ...
 let gpu_list = match ... {
     Err(err) => {
+        if let Some(procs) = &cpu_list {
+            self.machine.release_cpus(procs).await;
+        }
         return Err(err);
     }
     Ok(gpu_list) => gpu_list
 };
 ...
 // For any other subsequent error, also revert CPU/GPU reservations.

🧹 Nitpick comments (5)

crates/rqd/src/main.rs (2)

81-84: Consider stricter error handling for snapshot recovery.
Currently only logs a warning if snapshot recovery fails. If frame snapshots are critical for reliability, re-consider whether to propagate an error or handle partial failures in more detail.

92-97: Helpful documentation on required capabilities.
This comment block clarifies needed capabilities for launching processes in separate groups. If there's a narrower set of capabilities for your environment, consider updating them to follow the principle of least privilege.

crates/rqd/src/frame/manager.rs (1)

191-210: Threaded frame spawning.

The separate thread is ideal for blocking work.

Consider whether to use tokio::task::spawn_blocking to integrate with the runtime, enabling more flexible scheduling.
crates/rqd/src/system/linux.rs (1)
459-475: Improved core reservation logic.

The function checks available_cores before insertion.

Decrements cpu_stat.available_cores.

Recommend adding logs on successful reservations for better traceability.
 fn reserve_core(...){
     if self.cpu_stat.available_cores <= 0 {
         ...
     }
+    tracing::debug!("Reserving core {core_id} on physical socket {phys_id}");

     ...
 }
crates/rqd/src/system/machine.rs (1)

379-407: CoreReservation struct.

Storing a start_time can help with reporting or time-based cleanup.

Using HashSet<u32> ensures no duplication.

Consider adding an optional “count” or “capacity” if future memory usage is a concern for large clusters.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2daa810 and d1fb300.

📒 Files selected for processing (4)

crates/rqd/src/frame/manager.rs (1 hunks)
crates/rqd/src/main.rs (3 hunks)
crates/rqd/src/system/linux.rs (10 hunks)
crates/rqd/src/system/machine.rs (6 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: build

🔇 Additional comments (25)

crates/rqd/src/main.rs (6)

7-7: Looks good importing FrameManager and RunningFrameCache.
No issues found.

11-12: Consistent logging approach.
Importing MachineMonitor and warn aligns well with the tracing usage throughout the file.

16-16: Organizing frames into a dedicated module.
Declaring mod frame; keeps frame-related logic clearly separated.

44-44: Initialization clarity.
RunningFrameCache::init() does not appear to return an error, so no further error handling is needed. Good to keep it simple.

67-74: Initialization of FrameManager.
This is a concise approach for creating and storing the manager in an Arc. Ensure that shared references (e.g., mm_clone) won't lead to circular references if frames can indirectly hold references back to machine_monitor.

Please confirm no cyclical strong references exist by checking your references in the FrameManager and MachineMonitor. If uncertain, you can run a codebase search for Arc::clone usage outside the lines shown. Let me know if you’d like me to generate a script to verify this.

87-87: Clean handover to gRPC servant.
Passing frame_manager into servant::serve unifies your new frame-management approach with the existing servant structure.

crates/rqd/src/frame/manager.rs (5)

15-19: Clear struct design for FrameManager.
The fields config, frame_cache, and machine are well-defined. Ensure that external concurrency or state updates on these shared references won’t introduce race conditions in the manager methods.

If you want to confirm that concurrency usage is safe, consider running an automated concurrency analysis. Let me know if you’d like a script to do so.

125-190: Robust snapshot recovery.

The code properly filters and recovers valid .bin snapshots.

If an error arises (e.g., CPU reservation), it logs and accumulates them in errors.

The function continues to recover the remaining frames, which is typically desirable for partial success.

211-248: Validation checks.

Thorough checks for duplication, root UID, and fractional cores.

The code uses early returns to bail out quickly, which increases readability.

250-268: Machine state validations.
The hardware state and nimby checks are consistent with the rest of the logic. Nimby-locked frames are prevented unless ignore_nimby is explicitly set.

271-295: Well-defined error enum.
Mapping each variant to a relevant tonic::Status ensures consistent gRPC error responses.

crates/rqd/src/system/linux.rs (5)

11-11: Importing miette and re-exporting errors.
Brings better diagnostic info to system-level operations.

18-20: Extended usage from super::machine.
Now referencing CoreReservation and other specialized structs supports your expanded CPU reservation logic.

477-513: Efficient core availability calculation.

Excellent usage of filter_map and sorted_by to list the largest available sockets first.

Consider adding logs that mention how many cores remain unreserved per socket for debugging.

579-613: Bulk core reservation.

The check at line 580 quickly returns if there aren’t enough resources, preventing partial reservations.

The approach to flatten available cores and pick the first count is straightforward, but it might not always produce an optimal distribution. That is typically acceptable unless there's a performance reason to prefer a different strategy.

615-640: Selective reservation by ID.

Reusing calculate_available_cores() is consistent.

The loop attempts to reserve each requested core, ignoring nonexistent or already-reserved cores.

The final Ok(result) is straightforward.

crates/rqd/src/system/machine.rs (9)

4-4: Introduced time::Instant.
This will help measure how long cores remain allocated.

16-17: Extended tracing imports and Uuid.
Enables granular logging and resource-specific identification.

21-21: Using frame::cache::RunningFrameCache.
Allows the MachineMonitor direct access to the list of running frames, bridging LRU or other caching strategies.

181-189: MachineMonitor startup logic.
These lines revolve around initial host_state retrieval and broadcast. The design effectively sets up the initial conditions for new frames and delegates repeated checks to the main loop.

223-228: Machine trait reserve_cores adaptation.
Matching new arguments (num_cores, resource_id), bridging monitor to system controller.

230-239: reserve_cores_by_id bridging.
Mirrors the same approach for selective CPU allocation. Solid for reattaching to partially tracked frames.

241-254: Iterative release of cores.

Each core is released individually, ignoring NotFoundError with a warning.

This approach ensures partial release success even if some cores were not recognized.

343-355: Trait-defined core reservation.
Using Uuid as the resource identifier is a clean way to track the entity that owns each reservation. This extension updates the contract for all implementers.

375-375: Switched to CoreReservation in CpuStat.
Organizing reserved cores with unique owners is beneficial for better debugging and future expansions (like timestamps).

DiegoTavares added 4 commits February 28, 2025 16:56

Implement logic to snapshot a frame

2076c95

Implement logic to manually test launch frame

3bf0905

Refactor frame related items into a frame module

0428aac

Move frame control logic from of rqd_servant to manager

coderabbitai bot reviewed Mar 7, 2025

View reviewed changes

config/rqd.fake_linux.yaml Outdated Show resolved Hide resolved

crates/rqd/src/config/config.rs Outdated Show resolved Hide resolved

crates/rqd/src/frame/manager.rs Outdated Show resolved Hide resolved

Apply review suggestions

efaff54

coderabbitai bot reviewed Mar 7, 2025

View reviewed changes

Fix unit tests

d27c3a0

coderabbitai bot reviewed Mar 7, 2025

View reviewed changes

crates/rqd/src/frame/manager.rs Outdated Show resolved Hide resolved

crates/rqd/src/frame/running_frame.rs Outdated Show resolved Hide resolved

Implement review suggestions

0adb563

coderabbitai bot reviewed Mar 11, 2025

View reviewed changes

crates/rqd/src/frame/manager.rs Outdated Show resolved Hide resolved

crates/rqd/src/frame/running_frame.rs Outdated Show resolved Hide resolved

Apply suggested changes

2daa810

coderabbitai bot reviewed Mar 11, 2025

View reviewed changes

crates/rqd/src/frame/running_frame.rs Outdated Show resolved Hide resolved

Handle cpu reservations for recovered frames

d1fb300

coderabbitai bot reviewed Mar 13, 2025

View reviewed changes

Apply review suggestion

2a8b94e

DiegoTavares merged commit b9e7461 into main Mar 13, 2025
1 of 2 checks passed

coderabbitai bot mentioned this pull request Apr 12, 2025

Implement kill_running_frame logic #9

Merged

coderabbitai bot mentioned this pull request Apr 21, 2025

Add logic to retry grpc failed requests using tower #10

Merged

Snapshot frame #6

Snapshot frame #6

Uh oh!

Conversation

DiegoTavares commented Mar 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Valid Resolver Versions

Key Implementation Details

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Observations

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DiegoTavares commented Mar 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 7, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)