-
Notifications
You must be signed in to change notification settings - Fork 0
Snapshot frame #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Move frame control logic from of rqd_servant to manager
When rqd restarts all frames currently running become parented by the init/systemd, therefore the recovery process is no longer able to bind to them and follow their execution directly. To accomplish a similar behaviour, on recovery mode, running_frame pools it's pid waiting for it to finish and recovers the exit status and signal from a file. To accomplish that, all frame commands are now appended with a logic to save their exit status to a file.
Caution Review failedThe pull request is closed. WalkthroughThe changes update various Cargo.toml files to specify a workspace resolver and adjust dependencies, update configuration values, and restructure command-line handling across multiple crates. In the dummy-cuebot crate, new structures and modules are introduced to handle server and client commands with enhanced argument parsing and environment variable management. The opencue-proto crate now supports Serde serialization via added dependencies and derive macros. In the rqd crate, a new frame management system is integrated: a dedicated FrameManager replaces direct cache usage, with modifications to frame command construction, logging, snapshot recovery, and overall servant initialization. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as DummyCuebotCli
participant Client as DummyRqdClient
participant Manager as FrameManager
participant Server as RqdServant
CLI->>Client: Parse & prepare launch command
Client->>Manager: spawn(run_frame) request
Manager->>Server: Initiate frame launch in a new thread
Server-->>Manager: Acknowledge launch
Manager-->>Client: Return launch status
sequenceDiagram
participant Manager as FrameManager
participant SnapshotDir as Snapshots Directory
participant Cache as RunningFrameCache
Manager->>SnapshotDir: Read snapshot files
SnapshotDir-->>Manager: Provide valid snapshots
Manager->>Cache: Recover & insert running frames
Cache-->>Manager: Confirm recovery
Poem
Tip ⚡🧪 Multi-step agentic review comment chat (experimental)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (1)
crates/rqd/src/servant/running_frame_servant.rs (1)
24-37
:⚠️ Potential issueIncomplete implementation of RunningFrame trait methods.
The implementation of the
kill
andstatus
methods are still marked astodo!()
. These methods need to be implemented for the snapshot recovery functionality to work correctly.These methods are critical for the frame management functionality described in the PR objectives. Please implement them before merging this PR to ensure the snapshot mechanism works properly.
🧹 Nitpick comments (12)
crates/rqd/src/main.rs (1)
89-94
: Helpful system capability documentationThese comments provide valuable information about the Linux capabilities required for process group management, which is essential for the frame recovery mechanism described in the PR objectives.
Consider moving these comments to a more appropriate location such as documentation files or as code documentation closer to where these capabilities are used.
crates/dummy-cuebot/src/rqd_client.rs (2)
16-27
: Consider parallelizing or pooling gRPC connections for better concurrency.Currently, all requests share a single
Arc<Mutex<RqdInterfaceClient<Channel>>>
, which can become a bottleneck if multiple frames are launched concurrently. You may consider using a connection pool or creating ephemeral clients per request to improve scalability.
43-44
: Make the fallback username configurable.Using
daemon
as the default username is acceptable in many cases, but it might be more flexible to let users configure this fallback, especially if environment variables are unavailable.crates/rqd/src/frame/cache.rs (1)
25-55
: Be mindful of performance when replicating large caches.
into_running_frame_vec()
creates a full clone of the cached data, which can be expensive for large workloads. If this method is called frequently or with big data sets, consider an incremental, streaming, or lazy approach to reduce overhead.crates/rqd/src/frame/manager.rs (3)
38-42
: Consider using an async runtime for concurrency rather than manual threads.
std::thread::spawn
is effective for isolating frame logic, but usingtokio::spawn
orspawn_blocking
might integrate better with async patterns, especially if you need to monitor or await tasks later.
171-176
: Ensure panics are appropriately surfaced or rethrown.Catching and logging panics helps prevent crashing the entire process, but you might still wish to surface them to higher-level contexts, e.g., to mark frames as “failed” or trigger further cleanup.
245-260
: Expand error representation if needed.Current error variants cover essential scenarios, but you may need more nuanced errors for advanced debugging or partial failure cases. Extending
FrameManagerError
can improve traceability and user feedback.crates/rqd/src/servant/rqd_servant.rs (1)
41-43
: Panic on setup failure.
Panicking immediately may be acceptable for a core daemon, but consider returning an error to allow graceful shutdown or retries if partial context is available.crates/rqd/src/frame/running_frame.rs (4)
34-46
:RunningFrame
serialized & fields made public.
Serialization is needed for snapshots, but exposing many fields publicly can risk misuse. Consider documented getters/setters if further encapsulation is needed.
423-425
: Docker run method istodo!()
.
Implementation is pending.Would you like help drafting the Docker-based runner logic or opening an issue to track it?
432-459
:spawn_logger
with a separate thread monitoring raw outputs.
This approach works but you might consider an internal buffer or advanced log aggregator if performance or line ordering is critical.
686-705
:snapshot_path
creation.
Throws an error if PID is missing. This is correct but watch out for race conditions if PID is updated after partial initialization.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (23)
Cargo.toml
(1 hunks)config/rqd.fake_linux.yaml
(2 hunks)crates/dummy-cuebot/Cargo.toml
(1 hunks)crates/dummy-cuebot/src/main.rs
(3 hunks)crates/dummy-cuebot/src/report_servant.rs
(2 hunks)crates/dummy-cuebot/src/rqd_client.rs
(1 hunks)crates/opencue-proto/Cargo.toml
(1 hunks)crates/opencue-proto/build.rs
(1 hunks)crates/opencue-proto/src/lib.rs
(1 hunks)crates/rqd/Cargo.toml
(2 hunks)crates/rqd/src/config/config.rs
(3 hunks)crates/rqd/src/frame/cache.rs
(1 hunks)crates/rqd/src/frame/frame_cmd.rs
(3 hunks)crates/rqd/src/frame/logging.rs
(1 hunks)crates/rqd/src/frame/manager.rs
(1 hunks)crates/rqd/src/frame/mod.rs
(1 hunks)crates/rqd/src/frame/running_frame.rs
(20 hunks)crates/rqd/src/main.rs
(3 hunks)crates/rqd/src/servant/mod.rs
(2 hunks)crates/rqd/src/servant/rqd_servant.rs
(3 hunks)crates/rqd/src/servant/running_frame_servant.rs
(1 hunks)crates/rqd/src/system/machine.rs
(1 hunks)crates/rqd/src/system/mod.rs
(0 hunks)
💤 Files with no reviewable changes (1)
- crates/rqd/src/system/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (74)
Cargo.toml (1)
7-7
:✅ Verification successful
Verify the resolver version
The Cargo.toml specifies
resolver = "3"
, but as far as I'm aware, Cargo only supports resolver versions "1" and "2" in the current stable releases. Similarly,edition = "2024"
on line 6 refers to a future Rust edition that hasn't been released yet (current is 2021).
🌐 Web query:
What are the valid Cargo resolver versions?
💡 Result:
Cargo's resolver versions determine how dependencies and features are handled during the build process. There are three resolver versions available, each with distinct behaviors and compatibility requirements:
Valid Resolver Versions
Resolver "1" (default for Rust editions 2015/2018):
- Merges features across dependencies
- Original behavior with simpler feature unification
- MSRV: Works with Rust 1.51+ [1][2]
Resolver "2" (default for Rust 2021 edition):
- Avoids feature unification in specific cases:
- Platform-specific dependencies for non-target builds
- Separation between normal/built-dependencies and dev-dependencies
- Requires explicit workspace declaration in virtual workspaces [1][2][4]
Resolver "3" (default for Rust 2024 edition, requires Rust 1.84+):
- Changes default behavior for
resolver.incompatible-rust-versions
- Falls back to MSRV-compatible versions when newer dependencies require a Rust version exceeding your MSRV [1][3]
Key Implementation Details
- Specified via
resolver = "N"
inCargo.toml
or workspace definition- Workspace-level setting affecting all members
- Later editions automatically opt-in to newer resolvers (e.g., 2021 → resolver 2)
- MSRV considerations become more systematic with resolver 3 [1][3]
The resolver version impacts dependency resolution strategies and feature handling, with newer versions providing more precise control at the cost of potential duplicated builds when features diverge.
Citations:
- 1: https://doc.rust-lang.org/cargo/reference/resolver.html
- 2: https://doc.rust-lang.org/edition-guide/rust-2021/default-cargo-resolver.html
- 3: https://rust-lang.github.io/rfcs/3537-msrv-resolver.html
- 4: https://rustwiki.org/en/cargo/reference/features.html
- 5: Support different versions of the same dependency within a workspace rust-lang/cargo#13594
- 6: https://crates.io/crates/resolver
Verified: Cargo resolver and edition versions are valid
The Cargo.toml entries using
resolver = "3"
andedition = "2024"
are now supported. Resolver version 3 is valid for Rust 2024 (requiring Rust 1.84+), so these settings are appropriate if your toolchain meets that prerequisite. Please confirm your environment is updated accordingly.crates/dummy-cuebot/Cargo.toml (2)
23-23
: Possible annotation errorAccording to the AI summary,
itertools
was supposed to be removed, but it appears as an added/modified line in the annotation. Please verify if this dependency should be added or removed.
31-31
: Addition ofusers
crate aligns with PR objectivesThe addition of the
users
crate (which provides access to Unix users and groups information) makes sense for the frame snapshot feature, particularly when handling processes that become parented by init/systemd after rqd restarts.crates/opencue-proto/src/lib.rs (1)
75-77
: LGTM! Consistent implementation for resource_idThis implementation follows the same pattern as the existing ID methods, making for a consistent API. The method properly handles conversion errors by falling back to
Uuid::nil()
.crates/rqd/src/frame/mod.rs (1)
1-5
: Well-structured module organizationThe module structure provides a clean separation of concerns for frame management, with public modules (
cache
,manager
,running_frame
) exposing the necessary API while keeping implementation details internal (frame_cmd
,logging
). This organization aligns well with the PR objectives of implementing a snapshot mechanism for frames.crates/opencue-proto/Cargo.toml (1)
21-22
: Good addition of serialization dependenciesAdding Serde support is appropriate for implementing the frame snapshot functionality mentioned in the PR objectives. These dependencies will enable serialization and deserialization of protocol buffer types, which is essential for storing frame state to disk and recovering it when rqd restarts.
config/rqd.fake_linux.yaml (1)
13-14
:✅ Verification successful
New snapshot configuration path looks good
The new snapshots path configuration directly supports the PR objective of implementing a snapshot mechanism for frames. Using
$HOME/.rqd/snapshots
follows standard conventions for user-specific application data.Make sure the code handles directory creation if it doesn't exist.
🏁 Script executed:
#!/bin/bash # Check if the code creates the snapshot directory if it doesn't exist rg "create_dir_all|mkdir" --type rustLength of output: 226
Snapshot Directory Creation Verified – Approved
The new snapshots configuration using
$HOME/.rqd/snapshots
correctly follows user-data conventions and supports the PR objective. Our review confirms that the code handles directory creation, as evidenced by the use offs::create_dir_all
in both:
crates/rqd/src/frame/logging.rs
crates/rqd/src/servant/rqd_servant.rs
No further action is required.
crates/rqd/src/frame/logging.rs (1)
67-67
: Code simplification improvementGood refactoring of the loop condition. This change simplifies the code by eliminating a temporary variable and directly formatting the path string in the condition check, while maintaining the same functionality.
crates/rqd/src/main.rs (5)
7-7
: Good module organizationThe import change properly reflects the new architecture that uses a dedicated FrameManager to handle frame operations, which aligns with the PR's objectives for frame snapshot and recovery.
68-72
: Well-structured FrameManager implementationThe FrameManager implementation is well-organized, encapsulating the configuration, frame cache, and machine monitor components. This structure aligns with the PR objective of improving frame management and recovery.
84-84
: API update maintains consistencyThe update to use
frame_manager
instead ofrunning_frame_cache
in the servant call maintains architectural consistency by centralizing frame management through the FrameManager.
43-43
:✅ Verification successful
Verify RunningFrameCache initialization
The initialization of
running_frame_cache
no longer wraps it in an Arc directly. Make sure that theRunningFrameCache::init()
method returns an Arc internally, or that this change is consistent with how it's used throughout the codebase.
🏁 Script executed:
#!/bin/bash # Check the definition of RunningFrameCache::init rg "impl RunningFrameCache" -A 20Length of output: 1544
Action: Confirm RunningFrameCache Initialization
Upon verifying the implementation of
RunningFrameCache::init()
, it has been confirmed that it returns anArc<Self>
(located incrates/rqd/src/frame/cache.rs
). This update is consistent with its usage inmain.rs
, and no further changes are required.
80-81
:✅ Verification successful
Important snapshot recovery feature
This is a key implementation of the PR objective for frame recovery. The
recover_snapshots
call reads snapshots from the configured path and attempts to reconnect to frames after rqd restarts.Verify error handling for snapshot recovery failures.
🏁 Script executed:
#!/bin/bash # Check how errors are handled in the recover_snapshots implementation rg "fn recover_snapshots" -A 20Length of output: 1581
Snapshot Recovery Error Handling Verified
The implementation in
crates/rqd/src/frame/manager.rs
properly handles errors during snapshot recovery. Therecover_snapshots
function maps errors from reading the snapshot directory to a descriptive error message, logs a warning, and returns an error usingmiette!
. This confirms that failures in reading snapshots are correctly managed, aligning with the PR's objectives for robust frame recovery.crates/rqd/src/system/machine.rs (1)
17-20
: LGTM! Import reorganization looks good.The restructuring of imports reflects a shift from direct dependency on
running_frame::RunningFrameCache
to a more modular approach withframe::cache::RunningFrameCache
. This aligns with the snapshot frame implementation mentioned in the PR objectives.Also applies to: 23-23
crates/rqd/Cargo.toml (3)
18-18
: New dependency added for serialization.Adding
bincode
makes sense for serializing frame snapshots as mentioned in the PR objectives.
36-36
: LGTM! Added serde feature for UUID.Enabling the
serde
feature for UUID is necessary to properly serialize/deserialize frames with UUIDs in the snapshot system.
44-44
: LGTM! Process and signal features added.The addition of
process
andsignal
features to thenix
crate aligns perfectly with the PR's goal of monitoring process IDs and capturing exit signals during frame recovery.crates/rqd/src/servant/mod.rs (1)
2-2
: Migrated from RunningFrameCache to FrameManager.This is a significant architectural improvement that aligns with the PR objectives. The shift from a simple cache to a dedicated manager enables the new snapshot mechanism for frames and supports the recovery process when rqd restarts.
Also applies to: 20-24, 28-29
crates/dummy-cuebot/src/report_servant.rs (2)
1-13
: LGTM! Import reorganization and additions for server functionality.The imports have been restructured to better support the server functionality introduced in this PR.
58-71
: Well-implemented server functionality for testing.The addition of
DummyCuebotServer
with thestart_server
method provides a clean interface for testing the frame snapshot and recovery functionality described in the PR objectives.The server implementation correctly:
- Binds to all interfaces (0.0.0.0)
- Configures the RqdReportInterfaceServer
- Handles errors appropriately with diagnostic information
crates/rqd/src/frame/frame_cmd.rs (4)
6-7
: New field added to support exit code capturing.This field is part of the snapshot mechanism implementation mentioned in the PR objectives, allowing the frame to save its exit status to a file for later recovery.
15-18
: LGTM - FrameCmdBuilder initialization updated.Properly initializes the new exit_file field to None by default.
21-41
: Enhanced build method with exit code capture functionality.The build method now properly handles the exit file path by creating a command that saves the exit code to a file. This is a key part of the snapshot mechanism that enables frame recovery after rqd restarts.
The implementation captures the exit code of the command and writes it to the specified file before exiting with the same code, which will allow the recovery process to read the exit status accurately.
99-102
: Added builder method for exit file path.Follows the builder pattern consistently with the rest of the class.
crates/opencue-proto/build.rs (1)
48-66
: Added Serde serialization support for protocol buffer types.These changes enable serialization/deserialization for various types that will be used in the snapshot mechanism. This is necessary for saving frame state to disk and recovering it later.
The types being serialized are all related to process statistics and frame running state, which aligns with the PR objective of implementing a snapshot mechanism for frames.
crates/rqd/src/config/config.rs (3)
86-86
: Added Serialize derive for RunnerConfig.This enables the RunnerConfig struct to be serialized, which is needed for the snapshot feature.
99-99
: Added snapshots_path configuration option.This new field supports the snapshot mechanism by allowing configuration of where frame snapshots will be stored.
119-119
:❓ Verification inconclusive
Changed run_as_user default to false.
This change may affect how frames are executed. Ensure this change is intentional and consistent with the overall design.
🏁 Script executed:
#!/bin/bash # Check if there are any direct uses of this configuration value in the codebase rg "run_as_user" --type rustLength of output: 428
Attention: Verify Impact of
run_as_user
Default ChangeThe new default (
false
) incrates/rqd/src/config/config.rs
has been confirmed, and its usage appears consistent across the codebase:
- In crates/rqd/src/frame/running_frame.rs, the flag is checked and explicitly set to false.
- In crates/dummy-cuebot/src/main.rs, it’s also referenced for configuration.
Please verify that this behavior aligns with the overall design and that no component relies on a different default. Double-check that any logic previously expecting a different default is updated accordingly, and ensure the relevant tests cover these execution paths.
crates/rqd/src/servant/running_frame_servant.rs (3)
9-9
: Updated import to use FrameManager instead of RunningFrameCache.This change is part of the architectural shift from a cache-based approach to a more centralized frame management system.
13-13
: Replaced RunningFrameCache with FrameManager.The servant now uses the new FrameManager, aligning with the PR objective of implementing a snapshot mechanism for frames.
17-18
: Updated initialization to use FrameManager.The initialization is properly updated to use the new FrameManager structure.
crates/dummy-cuebot/src/rqd_client.rs (1)
65-73
: Looks good.Locking the client, building the request, and error handling with
into_diagnostic
andwrap_err
are well-structured. This design cleanly surfaces any connectivity issues to the caller.crates/rqd/src/frame/cache.rs (1)
16-21
: Cache initialization looks good.Using
DashMap::with_capacity(30)
provides a sensible initial capacity. The code is clear, and the concurrency approach with DashMap is appropriate.crates/rqd/src/servant/rqd_servant.rs (8)
1-3
: Imports look consistent and necessary.
These additions forfs
,path
, andArc
align with the usage in this file.
6-6
: New import forFrameManager
.
This import reflects the refactor to manage frames through a dedicated manager.
10-10
: Additional gRPC-related imports.
The code pulls in request/response types and therqd_interface_server::RqdInterface
trait. This is consistent with implementing the rqd gRPC interface.Also applies to: 22-22
24-24
:async_trait
import from tonic.
This is standard for asynchronous service trait implementations.
32-32
: Addedframe_manager
field toRqdServant
.
StoringArc<FrameManager>
is a clean way to coordinate frame lifecycle.
39-39
: Constructor parameter forframe_manager
.
AcceptingArc<FrameManager>
centralizes frame management at init time, improving modularity.
51-58
:setup
function ensures snapshot path.
Creating the directory if missing is straightforward. Error handling is already bubbled up. Code looks fine.
95-95
: Spawning a frame viaframe_manager
.
Delegating frame spawning to the manager helps maintain a single source of truth for frame lifecycle. Good approach.crates/dummy-cuebot/src/main.rs (9)
1-1
: Additional imports and module references.
The new imports (HashMap
,FromStr
,miette
,report_servant
,rqd_client
) are aligned with the updated functionality for server-client CLI.Also applies to: 3-3, 4-4, 8-8
23-30
:ReportServerCmd
now enforces a default port.
Using an explicit default value adds clarity and avoids handlingOption<u16>
. No issues.
35-41
:hostname
field with a default of "localhost".
Requiring a non-optional hostname is a solid approach.
43-49
: Rqd clientport
with a default of 4344.
Similar pattern toReportServerCmd
. Straightforward.
61-67
: NewLaunchFrameCmd
struct.
Captures command, environment, and user-run flag. This expands functionality elegantly.
69-70
:EnvVars
wrapper struct.
Encapsulating environment variables in a dedicated type can help keep logic organized.
72-89
:FromStr
forEnvVars
.
Parsing comma-separatedkey=value
pairs is handled neatly. The error messages are also appropriately descriptive.
95-96
: Starting the DummyCuebot server.
The call toDummyCuebotServer::start_server
is straightforward. No issues spotted.
98-100
: Constructing client and launching frame.
BuildingDummyRqdClient
and passing environment variables fromEnvVars
meets the new structure’s requirements and is well-structured.Also applies to: 101-117
crates/rqd/src/frame/running_frame.rs (24)
6-6
: AddedRead
to imports.
Used for reading data from files or streams. Fits the new exit-file reading logic.
16-16
:ExitStatusExt
andmpsc
introduced.
ExitStatusExt
to retrieve signals on Unix, andmpsc
for logging thread communication. Both are valid.
19-25
: Tracing, command builder, and serialization imports.
These additions (tracing
,FrameCmdBuilder
,Serialize
,Deserialize
,sysinfo
) are central to the new snapshot and logging functionalities.
30-30
: Importing local logging components.
FrameLogger
andFrameLoggerBuilder
unify how logs are streamed.
50-53
: Exit file path & custom (de)serialization hooks.
Storing exit status externally and skipping the thread handle in snapshots are sensible for the new recovery logic.
56-56
:RunningState
gainsexit_signal
and custom skip logic.
Including the exit signal helps differentiate normal vs. signal-based termination. The skip logic for the thread handle is well-executed.Also applies to: 60-66
72-72
: Default forexit_signal
.
Ensures the field is initialized even absent signals. Straightforward.
77-97
:FrameStats
now cloned and serialized.
This helps capture state in snapshots. The usage ofsysinfo::ChildrenProcStats
fits well with the broader resource monitoring approach.
140-166
: Initialization withresource_id
and exit file path.
Appendingresource_id
to logs ensures uniqueness. Verify thatresource_id
is always set.
188-203
: Doc comments forupdate_launch_thread_handle
.
Clear explanation, ensuring the concurrency aspect is understood.
204-219
:update_exit_code_and_signal
method.
Setting both exit code and signal closes the gap for signal-based terminations.
280-325
:run
method handles recovery vs. normal flow.
Snapshot clearing is prudent. Consider concurrency ifrun
is called multiple times.
328-421
:run_inner
method with updated exit code logic.
Snapshot after spawn ensures we can recover if RQD crashes mid-execution. Overall flow is coherent.
461-505
:recover_inner
for resuming frames.
Reading the exit file and waiting for the process are good strategies.
516-517
:pid()
accessor.
Provides optional process ID. Straightforward concurrency usage ofMutex
.
539-576
:read_exit_file
to parse the final exit code.
Error handling is robust; reusing the 128+signal pattern is consistent with Unix conventions.
593-623
:wait
function polls for process termination.
Usingkill(pid, None)
plus a 500ms sleep is typical on Unix. Fine for basic usage.
707-719
:snapshot
uses bincode for persistence.
Simple approach. If partial or concurrent writes are possible, consider a more atomic solution.
721-724
:clear_snapshot
removes snapshot file.
Method is direct. The caller handles ignoring errors if the snapshot doesn’t exist.
748-780
:from_snapshot
rehydratesRunningFrame
.
Properly ensures the process is still alive and updates with the new config.
782-790
:is_process_running
usingsysinfo
.
Partial refresh of a single PID is a good optimization.
849-859
: Custom serialization forArc<Mutex<RunningState>>
.
Skipping the thread handle is a clever workaround to avoid non-serializable data. Well handled.Also applies to: 861-877, 879-889
900-901
: Additional internal imports in tests.
Refers toRunnerConfig
andFrameLogger
. Harmless enhancements.Also applies to: 902-902
965-967
: Augmented test assertions.
These changes check both exit code and signal, ensuring coverage of normal and failing scenarios.Also applies to: 971-974, 985-988, 1006-1007, 1010-1011, 1012-1015, 1030-1031, 1033-1037, 1050-1051, 1052-1056, 1073-1074, 1080-1084, 1100-1101, 1108-1112, 1126-1127, 1128-1133
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
crates/dummy-cuebot/src/main.rs (3)
69-89
: Environment variables parsing has limitations.The current implementation of
EnvVars
has a limitation where it doesn't support environment variables with=
in their values, which is a valid use case (e.g.,PATH=/usr/bin:/usr/local/bin
).Consider using a more robust parsing approach that can handle values containing equals signs:
- for pair in s.split(',') { - let parts: Vec<&str> = pair.split('=').collect(); - if parts.len() != 2 { - return Err(format!("Invalid key-value pair: {}", pair)); - } - - map.insert(parts[0].trim().to_string(), parts[1].trim().to_string()); + for pair in s.split(',') { + // Find the first equals sign + if let Some(idx) = pair.find('=') { + let key = pair[..idx].trim().to_string(); + let value = pair[idx+1..].trim().to_string(); + map.insert(key, value); + } else { + return Err(format!("Invalid key-value pair (missing '='): {}", pair)); + }
95-96
: Unnecessary clone on primitive type.The
clone()
on a primitiveu16
type is unnecessary as it implements Copy.- DummyCuebotServer::start_server(report_server_cmd.port.clone()).await + DummyCuebotServer::start_server(report_server_cmd.port).await
98-118
: Optimize client initialization and refactor for readability.Several improvements could be made to this section:
- Avoid unnecessary cloning
- Extract the environment variable handling for better readability
- Refactor conditional UID extraction for clarity
- let client = - DummyRqdClient::build(rqd_client_cmd.hostname.clone(), rqd_client_cmd.port) - .await?; - match &rqd_client_cmd.api_method { - ApiMethod::LaunchFrame(launch_frame_cmd) => { - let uid = launch_frame_cmd - .run_as_user - .then(|| users::get_current_uid()); - client - .launch_frame( - launch_frame_cmd.cmd.clone(), - launch_frame_cmd - .env - .clone() - .unwrap_or(EnvVars(HashMap::new())) - .0, - uid, - ) - .await - } + let client = DummyRqdClient::build(&rqd_client_cmd.hostname, rqd_client_cmd.port) + .await?; + match &rqd_client_cmd.api_method { + ApiMethod::LaunchFrame(launch_frame_cmd) => { + // Extract UID if running as user + let uid = if launch_frame_cmd.run_as_user { + Some(users::get_current_uid()) + } else { + None + }; + + // Prepare environment variables + let env_vars = launch_frame_cmd.env + .as_ref() + .map_or_else( + || HashMap::new(), + |env| env.0.clone() + ); + + client + .launch_frame(&launch_frame_cmd.cmd, env_vars, uid) + .await + }Note: The refactoring assumes the
DummyRqdClient::build
andlaunch_frame
methods can be updated to accept references where appropriate.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
config/rqd.fake_linux.yaml
(2 hunks)crates/dummy-cuebot/src/main.rs
(3 hunks)crates/rqd/src/config/config.rs
(4 hunks)crates/rqd/src/frame/manager.rs
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- config/rqd.fake_linux.yaml
- crates/rqd/src/config/config.rs
- crates/rqd/src/frame/manager.rs
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (4)
crates/dummy-cuebot/src/main.rs (4)
1-1
: Updated imports to support new functionality.The addition of
HashMap
andFromStr
is appropriate for the new environment variables functionality.
23-31
: Good simplification of ReportServerCmd port handling.Changing from
Option<u16>
to requiredu16
with a default value removes unnecessary option handling and makes the code more straightforward.
34-53
: Improved type safety and simplicity in RqdClientCmd.Converting
port
fromOption<String>
tou16
with a default value improves type safety and removes unnecessary option handling. The hostname field change similarly simplifies the code.
61-67
: New LaunchFrameCmd structure provides clear configuration options.The new structure cleanly encapsulates all the parameters needed for launching a frame, with appropriate documentation on the command line options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
crates/rqd/src/frame/running_frame.rs (1)
140-186
: 🛠️ Refactor suggestionConsider ensuring snapshot path creation
Lines 168-169 comment “Ensure snapshot path exists,” but no implementation is provided. If the directory is missing, snapshot creation will fail.Example fix:
use std::fs; ... // Ensure snapshot path exists fs::create_dir_all(&config.snapshots_path).into_diagnostic()?;
🧹 Nitpick comments (7)
crates/rqd/src/frame/manager.rs (4)
117-164
: Consider asynchronous I/O for reading snapshot files.Performing blocking I/O calls to read large snapshot directories in a potentially async environment can cause performance bottlenecks. Using
tokio::fs
or an equivalent approach might improve scalability.
166-185
: Mixing thread spawn with async environment.This manual thread creation and capturing panics is acceptable when heavy blocking tasks are required. If more integration with async is desired, consider switching to
tokio::spawn_blocking
for consistency and centralized runtime management.
190-190
: Fix the minor typo in error message."Not lauching" should be "Not launching" for better clarity.
- "Not lauching, frame is already running on this host {}", + "Not launching, frame is already running on this host {}",
225-243
: Optional logging of hardware state.Consider logging the current hardware state for easier debugging if frame spawn is blocked because of the host status.
crates/rqd/src/frame/running_frame.rs (3)
34-54
: Public fields may reduce encapsulation
Exposing so many fields withpub
can reduce encapsulation. If external usage does not strictly require direct field access, consider using getters/setters or restricting visibility.
425-427
: run_on_docker is unimplemented
This placeholder remains atodo!()
. Let me know if you’d like assistance implementing or if this is intended for a future PR.
429-432
: Windows run_inner is unimplemented
Same as docker: currently stubbed out withtodo!()
. Happy to help provide an implementation or open a new issue if needed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
crates/rqd/src/config/config.rs
(7 hunks)crates/rqd/src/config/error.rs
(1 hunks)crates/rqd/src/frame/manager.rs
(1 hunks)crates/rqd/src/frame/running_frame.rs
(21 hunks)crates/rqd/src/servant/rqd_servant.rs
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/rqd/src/config/config.rs
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (33)
crates/rqd/src/config/error.rs (1)
13-15
: Well-defined error variant for path issues.Adding the
InvalidPath(String)
variant effectively distinguishes misconfiguration scenarios. This helps ensure better error messaging and handling when paths are invalid.crates/rqd/src/servant/rqd_servant.rs (4)
1-2
: No immediate concerns with new imports.These imports (
std::fs
andstd::path::Path
) might be used for I/O operations. They appear safe to include.
32-32
: Use ofArc<FrameManager>
looks correct.Replacing the old cache reference with a
FrameManager
reference centralizes frame handling. This aligns with the design direction of encapsulating frame logic in one place.
39-39
: Clear function signature reflecting new field.Exchanging the
running_frame_cache
parameter withframe_manager
ininit
is consistent with the new architecture. Ensure that all usage sites were updated accordingly.
83-83
: Async spawn call for running frames is well-placed.Invoking
self.frame_manager.spawn(run_frame).await?
centralizes frame launch logic. Ensure any resulting error is properly surfaced in logs or user-facing messages.crates/rqd/src/frame/manager.rs (1)
246-270
: Error definitions look comprehensive.Your
FrameManagerError
variants neatly map to gRPC statuses. This design clarifies failure cases and ensures consistent handling across the codebase.crates/rqd/src/frame/running_frame.rs (27)
6-6
: New imports look appropriate
All newly introduced imports are consistent with the subsequent code usage. No immediate concerns regarding security, correctness, or performance.Also applies to: 16-16, 19-19, 21-21, 23-23, 24-24, 26-26, 27-27, 30-30
56-74
: RunningState structure changes
The addition of anexit_signal
field and usage of#[serde(skip_serializing)]
for the thread handle look correct. No glaring concurrency or correctness issues identified.
77-97
: FrameStats updates
The structure alterations appear to be straightforward expansions of resource-tracking fields. No issues found.
190-205
: Storing launch thread handle
Using a dedicated method to store the join handle is a clear approach for lifecycle management. No further concerns.
206-221
: Update exit code and signal
This design neatly centralizes exit state updates. Looks good as implemented.
271-327
: run method introduces recovery logic
The logic distinguishing regular run vs. recover mode looks coherent. Error handling is properly logged.
330-423
: run_inner method
Spawning the command, capturing logs, and snapshotting post-spawn all align well. Error handling is improved with informative diagnostics.
434-461
: spawn_logger concurrency
Creating a background thread to pipe stdout/stderr is well-structured. Properly receiving a signal to terminate ensures it won’t block indefinitely.
463-507
: recover_inner logic
Good approach for reattaching logs and waiting until the process completes. Graceful fallback to signal-based assumption of exit if reading the exit file fails.
509-519
: pid retrieval
Straightforward locking to retrieve internal PID. No concurrency pitfalls spotted here.
521-578
: read_exit_file handling
Opening and parsing the file with robust error messages is good. Using exit_code < 128 vs. ≥ 128 logic to detect signals is correct.
580-625
: wait method
Polling withkill(..., None)
is a reasonable fallback. The 500ms sleep avoids busy looping extremes. Implementation appears sound.
688-720
: snapshot & snapshot_path
The serialization approach looks good. As noted earlier ininit
, ensure the directory is created before writing snapshots to disk.
722-725
: clear_snapshot
Simply removing the file after successful run or recovery is concise and safe.
783-790
: is_process_running
Using the sysinfo crate to verify an active process is a convenient approach.
849-853
: none_value helper
Simple function returningNone
. No issues.
854-878
: serialize_running_state
Skipping the thread handle during serialization is correct to avoid undefined behavior. Nicely handled fallback on poison.
880-889
: deserialize_running_state
Wrapping the deserializedRunningState
inArc<Mutex<...>>
is nicely done.
901-903
: Test imports
Introducing config and logging modules for testing looks consistent with test usage.
967-979
: test_run_logs_stdout_stderr
Checks for exit code and verifies stdout/stderr appear in the logger. Straightforward test coverage.
984-992
: test_run_failed
Verifies non-zero exit code for a failing command. Nicely done.
996-1019
: test_run_multiline_stdout
Ensures multiple lines are captured in order. Also confirms exit file content.
1024-1041
: test_run_env_variables
Appropriately checks that environment variables are respected.
1054-1061
: test_run_command_not_found
Verifies non-zero exit code for a missing command. Good negative test scenario.
1077-1088
: test_run_sleep_command
Ensures a minimal sleep command runs the expected duration. Straightforward.
1093-1116
: test_run_interleaved_stdout_stderr
Correctly checks interleaved logs for correctness.
1118-1137
: test_run_with_special_characters
Good coverage for environment variables containing special symbols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
crates/rqd/src/system/linux.rs (1)
541-599
: 🛠️ Refactor suggestionRobust filtering but watch out for partial reservations.
Thereserve_cores
logic filters available cores effectively, but note that partial reservations might stay allocated if an error interrupts after partially reserving some cores. A transaction-like approach or final rollback mechanism could avoid orphaned reservations.
🧹 Nitpick comments (14)
crates/rqd/src/frame/manager.rs (4)
124-171
: Validate parallel recoveries or concurrency.
Therecover_snapshots
method effectively iterates over snapshots and spawns frames. In scenarios with many concurrent snapshots, consider potential concurrency or partial recovery. For instance, failing on one snapshot currently doesn’t halt the entire process, which is good, but you might want to ensure the method is safe under high concurrency if multiple calls torecover_snapshots
are possible.
173-191
: Consider tokio spawning for asynchronous scheduling.
spawn_running_frame
uses a blocking thread. This is perfectly fine for certain workloads, but if you plan to integrate with a larger async ecosystem, consider usingtokio::task::spawn_blocking
or an async runtime approach to keep the system responsive.
232-250
: Optional: store previous hardware states for troubleshooting.
The machine validation is concise. If debugging frequent or intermittent hardware state transitions, consider logging or storing previous states to help track changes over time.
253-265
: Expand error coverage if needed.
TheFrameManagerError
covers core scenarios well. If you anticipate partial resource allocation or cleaning issues, consider adding specialized variants to detail those cases.crates/rqd/src/system/linux.rs (1)
459-475
: Validate concurrency if multiple threads callreserve_core
.
This function correctly reducesavailable_cores
and updates reserved data structures. Ensure external concurrency is well-managed if multiple calls might interfere.crates/rqd/src/frame/running_frame.rs (9)
36-53
: Consider encapsulating new public fields.
Exposing all fields aspub
could increase coupling and reduce the maintainability of your code, especially for fields likeexit_file_path
which might need validation or sanitization. If external read/write access is necessary, consider using getter/setter methods or implementingpub(crate)
where possible.
146-162
: Potential risk in path generation.
Appendingresource_id
to paths is functional, but watch for collisions or unexpected special characters from user input. Consider sanitizing or validating the resource ID if it can come from untrusted sources.
271-327
: Check error handling flow inrun()
.
Whenrecover_mode
istrue
, any spawn-related errors bubble up, but the code logs the error and returns. Confirm this aligns with higher-level expectations (e.g., does the system handle partial recoveries gracefully?).
330-423
: Comprehensive implementation forrun_inner
.
- Good approach to log final exit code.
- Consider ensuring that
snapshot()
success/failure is also logged more distinctly for easier debugging.- Potential improvement: timeouts on
child.wait()
to avoid indefinite blocking if the process hangs.
425-427
:run_on_docker
remains unimplemented.
If Docker support is planned, consider returning an explicit error or removing the stub if it's not on the roadmap.
429-432
: Windows branch is unimplemented.
Same recommendation as Docker: either implement or remove the placeholder.
434-462
:spawn_logger
concurrency check
Ensure the log thread terminates properly even if the channel signal is delayed or never sent (e.g., in panic scenarios). Right now, it loops indefinitely, which is correct but can leak resources if the signal is never received.
580-625
: Polling-basedwait
might block indefinitely.
If a process fails to exit,thread::sleep(..)
loop runs forever. Consider a timeout or user-configurable max wait period.
688-726
: Snapshot logic improvement.
snapshot_path()
: rely on the same directory creation approach to prevent I/O errors.snapshot()
: consider versioning or controlling snapshot size.clear_snapshot()
: good to remove ephemeral data, but confirm concurrency safety if multiple frames share the same path naming scheme.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/rqd/src/frame/manager.rs
(1 hunks)crates/rqd/src/frame/running_frame.rs
(21 hunks)crates/rqd/src/system/linux.rs
(13 hunks)crates/rqd/src/system/machine.rs
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (36)
crates/rqd/src/frame/manager.rs (3)
14-18
: Struct looks clean and succinct.
TheFrameManager
struct neatly declares the required fields for configuration, frame caching, and host interaction. This design is clear and maintainable.
193-230
: Input checks are well-handled.
This method provides clear validations for frames, including non-root enforcement and fractional core rejections. The structure is straightforward and helps avoid invalid states early on.
267-276
: Error-to-status translation looks correct.
Mapping custom errors to gRPC status codes is comprehensive and aligns with typical usage oftonic::Status
.crates/rqd/src/system/linux.rs (6)
2-59
: Minor import and struct updates.
Changes in import statements and supporting code look fine, reflecting new types and references without apparent issues.Also applies to: 867-873
883-894
: Test approach is straightforward.
test_reserve_cores_success
verifies correct allocation. The usage is concise and ensures basic coverage.
897-907
: Good negative test coverage.
test_reserve_cores_insufficient_resources
properly confirms that the function rejects requests surpassing available resources.
910-919
: Full allocation confirmed.
test_reserve_cores_all_available
ensures the function handles exact capacity requests. Coverage is appropriate.
922-942
: Affinity distribution validated.
test_reserve_cores_affinity
checks that selected cores are grouped. This is helpful to confirm the intended scheduling logic.
945-964
: Sequential reservations tested thoroughly.
test_reserve_cores_sequential_reservations
ensures multiple calls toreserve_cores
behave as expected. This is a solid scenario test.crates/rqd/src/system/machine.rs (5)
4-4
: Import additions look good.
The added imports for tracing, UUID, and time utilities align with the new resource reservation logic and logging needs.Also applies to: 16-17, 21-21, 25-25
181-201
: Machine trait implementation: hardware reporting is concise.
Thehardware_state
method returning an optional state fromRenderHost
is a neat approach. No major issues are apparent.
220-226
: Frame ID usage ensures resource tracking.
reserve_cpus
now acceptsframe_id
, helping tie reservations to specific frames. This is a solid improvement for debugging and accountability.
227-241
: Release loop handles partial errors gracefully.
Releasing each core individually logs errors without blocking subsequent releases. This approach helps avoid leftover leaks whenever partial releases fail.
354-382
:CoreReservation
encapsulation is beneficial.
Storingreserver_id
and maintaining astart_time
can facilitate advanced scheduling or cleanup tasks. The methodsinsert
andremove
are straightforward.crates/rqd/src/frame/running_frame.rs (22)
6-6
: No concerns with newly added imports.
They correctly bring in needed libraries (e.g.,sysinfo
,serde
,miette
) without extraneous dependencies.Also applies to: 16-16, 19-19, 21-21, 23-24, 26-27, 30-30
60-63
: Confirm concurrency handling forlaunch_thread_handle
.
Marking thelaunch_thread_handle
as skipped in serialization is appropriate. However, ensure that concurrency scenarios are handled (e.g., if the handle is never joined, leftover threads might continue running).
77-97
: Looks good for FrameStats serialization.
MakingFrameStats
serializable is useful for snapshotting. Verify that large output from GPU or memory stats does not bloat snapshot files.
140-140
: Validateresource_id
usage.
The newresource_id
field is read from the request but not validated. Ensure the user input is trusted or safely sanitized before using it in file paths.
190-198
: Good documentation for thread handle.
Clear docstring describing usage of the launch thread handle. No functional issues spotted.
206-214
: Update exit code & signal with caution.
Ensure this is called only when the process has truly exited, and confirm concurrency is avoided if multiple threads might call it.
220-220
: No issue with setting the exit signal.
Storing the signal can be valuable for debugging abrupt terminations.
262-262
: No concerns with static fallback PATH.
This is a reasonable fallback for non-Windows systems.
463-507
: Recovery logic is robust, but watch for partial states.
- If the process unexpectedly died before RQD restarted, we handle exit code from the file or default to
(1, Some(143))
. Consider more granular checks if the process died with a different signal.- Make sure logs reflect partial or incomplete states for better debugging.
509-519
: PID accessor is straightforward.
No major issues. Just ensure it's not used for active process checks without verifying the process is alive (i.e.is_process_running
).
521-579
: Robustread_exit_file
.Observations
- Parsing exit codes >=128 as signal-based is standard.
- Watch for negative or extremely large values in the file that might indicate corruption.
727-788
: [Duplicate of previous PID reuse concern].
PID reuse remains a potential bug if the OS cycles through PIDs quickly. The code checks only whether the PID is alive, not whether it’s the same process. Consider storing start time or another unique token in the snapshot to confirm the process identity.
855-899
: Custom serialization forArc<Mutex<RunningState>>
.
Skipping the thread handle ensures safe (de)serialization. Confirm that ignoring the thread handle on restore is always acceptable (i.e., no critical logic is lost).
907-910
: New test imports appear valid.
No issues; they match the newly added tests.
977-984
: Enhanced check for exit status intest_run_logs_stdout_stderr
.
Validates both normal exit and correct logging. Good coverage.
998-998
:test_run_failed
checks exit code (1).
Verifies a failing command returns1
. This is an important negative test.
1016-1025
:test_run_multiline_stdout
:
Nice check of multiple lines plus verifying the exit code file.
1042-1047
:test_run_env_variables
:
Asserting environment variables are echoed. Confirms the environment setup logic.
1060-1066
:test_run_command_not_found
:
Confirms a non-zero exit code for invalid commands. Good negative-path test.
1083-1093
:test_run_sleep_command
:
Verifies timing, ensures the command is indeed run. Nice coverage.
1110-1122
:test_run_interleaved_stdout_stderr
:
Checks interleaved output is captured in logs. Great stress test for concurrency in logging.
1136-1143
:test_run_with_special_characters
:
Useful test to confirm environment variable handling. No concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
crates/rqd/src/frame/running_frame.rs (4)
269-325
:run()
method logic is sound.
The option to recover or launch a fresh frame, and then clear snapshots if the frame was successfully spawned, is well-structured. A minor enhancement could be to snapshot again post-completion for debugging, but this is not critical.
328-421
: Consider adding a configurable timeout inrun_inner()
.
While the current loop design is sufficient for many scenarios, introducing a way to automatically terminate or time out a job can prevent indefinite resource usage. Otherwise, this logic—updating the PID, spawning the child, saving snapshots, logging, and retrieving the exit status—looks correct.
423-425
:run_on_docker
is unimplemented.
If you’d like help completing this Docker integration or want to track it, feel free to open an issue.
427-431
: Windows runner is unimplemented.
Similar to Docker, if you plan on supporting Windows, consider opening a follow-up issue or PR to flesh out this method.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crates/rqd/src/frame/running_frame.rs
(21 hunks)
🔇 Additional comments (25)
crates/rqd/src/frame/running_frame.rs (25)
6-6
: All newly added import statements look good.
These additions (e.g.,ExitStatusExt
,mpsc
,tracing
macros, Serde traits, etc.) are necessary for the enhanced process-management and serialization logic.Also applies to: 16-16, 19-19, 21-21, 23-23, 24-24, 26-26, 27-27, 30-30
56-75
: Serialization ofRunningState
looks correct.
The addition ofexit_signal: Option<i32>
and skipping the thread handle during serialization is a neat approach. This ensures that any runtime-specific data (e.g.JoinHandle
) is not inadvertently persisted.
77-97
: FrameStats serialization is fine.
Exposing GPU/CPU stats fields for post-run analysis or recovery makes sense. The changes look consistent and safe.
140-183
: Ensure output directories exist for log and exit files.
These lines introduceexit_file_path
(and also refineraw_stdout_path
/raw_stderr_path
) but do not guarantee that therequest.log_dir
directory is present. This can lead to runtime errors if the directory is missing. Consider callingstd::fs::create_dir_all(...)
before file creation.
188-203
: Addition ofupdate_launch_thread_handle
is well-documented.
Storing theJoinHandle
for monitoring or later joins is a clean approach. No concerns here.
204-219
: Exit code & signal updates are correctly implemented.
The docstring is also clear on usage. This straightforward approach properly encapsulates termination details.
432-459
: Logger thread spawning approach looks good.
Using a channel to signal completion and safely remove raw log files prevents leftover artifacts.
461-505
: Recovery logic is straightforward and consistent.
Falling back to(1, Some(143))
(SIGTERM) if reading the exit file fails is a decent default; it signals the frame was likely killed.
507-517
:pid()
accessor implementation is fine.
The lock-based retrieval ensures thread-safe access to the stored PID.
519-576
: Validate assumption for exit codes ≥128 as signals.
Posix-based systems typically encode signals in exit codes as128 + signal
. However, some processes may return codes >128 for other reasons. If that possibility is relevant, consider clarifying or adjusting the logic.
593-623
: Polling withkill(nix_pid, None)
is acceptable.
This loop with a 500ms sleep is a standard approach for checking process liveness. No issues found.
686-723
: Directory creation for snapshots remains unaddressed.
The snapshot path is created viaFile::create
but never ensures thatself.config.snapshots_path
exists. As previously mentioned, usecreate_dir_all(&self.config.snapshots_path)
or an equivalent solution.
725-785
: PID uniqueness caveat is documented.
You have a TODO for verifying PID reuse infrom_snapshot
. This acknowledges the past review comment. Implementation or process start-time checks can mitigate incorrect reattachments.
787-794
:is_process_running
logic is straightforward.
Relying onsysinfo
to query the presence of the PID is a clean approach.
853-895
: Custom (de)serialization forArc<Mutex<RunningState>>
is well-handled.
Skipping the thread handle during serialization prevents runtime handle leaks. Good use of#[serde(skip_...)]
.
905-908
: New test imports are correct.
They support the refined test suite for reading exit files, environment variables, etc.
919-921
: Config setup in tests is fine.
This ensures thatRunnerConfig
can be customized for each test scenario.
975-982
: Test verifies reading exit file for successful runs.
These assertions confirm correctness of the new exit-file logic.
989-998
:test_run_failed
covers non-zero exit codes.
Testing a failure scenario ensures robust exit-code handling.
1003-1023
:test_run_multiline_stdout
is thorough.
Verifying multiple lines and matching the exit file’s status covers typical console output behaviors.
1038-1045
:test_run_env_variables
checks environment correctness.
Ensuring environment variable injection is validated helps confidence in production usage.
1058-1063
:test_run_command_not_found
ensures non-zero exit code.
Good negative test scenario to confirm error handling.
1069-1092
:test_run_sleep_command
verifies timing logic.
Assessing command duration is a nice addition. No concerns.
1108-1120
:test_run_interleaved_stdout_stderr
ensures correct ordering.
Confirming both streams show up as expected is a great coverage scenario.
1134-1141
:test_run_with_special_characters
expands coverage.
Verifying unusual characters in environment variables helps prevent unexpected parsing failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
crates/rqd/src/frame/manager.rs (1)
21-123
: 🛠️ Refactor suggestionRefined spawn logic.
- Good check for nimby locks and hardware state.
- Clean user creation flow with meaningful error messages.
- Resource rollback is partially handled for GPU-allocate failures; however, CPU rollback after line 62 warns about potential resource leaks if more errors occur post-CPU reservation.
// Pseudocode to unify all resource rollbacks if something fails: let cpu_list = // ... let gpu_list = match ... { Err(err) => { + if let Some(procs) = &cpu_list { + self.machine.release_cpus(procs).await; + } return Err(err); } Ok(gpu_list) => gpu_list }; ... // For any other subsequent error, also revert CPU/GPU reservations.
🧹 Nitpick comments (5)
crates/rqd/src/main.rs (2)
81-84
: Consider stricter error handling for snapshot recovery.
Currently only logs a warning if snapshot recovery fails. If frame snapshots are critical for reliability, re-consider whether to propagate an error or handle partial failures in more detail.
92-97
: Helpful documentation on required capabilities.
This comment block clarifies needed capabilities for launching processes in separate groups. If there's a narrower set of capabilities for your environment, consider updating them to follow the principle of least privilege.crates/rqd/src/frame/manager.rs (1)
191-210
: Threaded frame spawning.
- The separate thread is ideal for blocking work.
- Consider whether to use
tokio::task::spawn_blocking
to integrate with the runtime, enabling more flexible scheduling.crates/rqd/src/system/linux.rs (1)
459-475
: Improved core reservation logic.
- The function checks
available_cores
before insertion.- Decrements
cpu_stat.available_cores
.- Recommend adding logs on successful reservations for better traceability.
fn reserve_core(...){ if self.cpu_stat.available_cores <= 0 { ... } + tracing::debug!("Reserving core {core_id} on physical socket {phys_id}"); ... }
crates/rqd/src/system/machine.rs (1)
379-407
:CoreReservation
struct.
- Storing a
start_time
can help with reporting or time-based cleanup.- Using
HashSet<u32>
ensures no duplication.- Consider adding an optional “count” or “capacity” if future memory usage is a concern for large clusters.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
crates/rqd/src/frame/manager.rs
(1 hunks)crates/rqd/src/main.rs
(3 hunks)crates/rqd/src/system/linux.rs
(10 hunks)crates/rqd/src/system/machine.rs
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: build
🔇 Additional comments (25)
crates/rqd/src/main.rs (6)
7-7
: Looks good importing FrameManager and RunningFrameCache.
No issues found.
11-12
: Consistent logging approach.
ImportingMachineMonitor
andwarn
aligns well with the tracing usage throughout the file.
16-16
: Organizing frames into a dedicated module.
Declaringmod frame;
keeps frame-related logic clearly separated.
44-44
: Initialization clarity.
RunningFrameCache::init()
does not appear to return an error, so no further error handling is needed. Good to keep it simple.
67-74
: Initialization of FrameManager.
This is a concise approach for creating and storing the manager in anArc
. Ensure that shared references (e.g.,mm_clone
) won't lead to circular references if frames can indirectly hold references back tomachine_monitor
.Please confirm no cyclical strong references exist by checking your references in the
FrameManager
andMachineMonitor
. If uncertain, you can run a codebase search forArc::clone
usage outside the lines shown. Let me know if you’d like me to generate a script to verify this.
87-87
: Clean handover to gRPC servant.
Passingframe_manager
intoservant::serve
unifies your new frame-management approach with the existing servant structure.crates/rqd/src/frame/manager.rs (5)
15-19
: Clear struct design forFrameManager
.
The fieldsconfig
,frame_cache
, andmachine
are well-defined. Ensure that external concurrency or state updates on these shared references won’t introduce race conditions in the manager methods.If you want to confirm that concurrency usage is safe, consider running an automated concurrency analysis. Let me know if you’d like a script to do so.
125-190
: Robust snapshot recovery.
- The code properly filters and recovers valid
.bin
snapshots.- If an error arises (e.g., CPU reservation), it logs and accumulates them in
errors
.- The function continues to recover the remaining frames, which is typically desirable for partial success.
211-248
: Validation checks.
- Thorough checks for duplication, root UID, and fractional cores.
- The code uses early returns to bail out quickly, which increases readability.
250-268
: Machine state validations.
The hardware state and nimby checks are consistent with the rest of the logic. Nimby-locked frames are prevented unlessignore_nimby
is explicitly set.
271-295
: Well-defined error enum.
Mapping each variant to a relevanttonic::Status
ensures consistent gRPC error responses.crates/rqd/src/system/linux.rs (5)
11-11
: Importing miette and re-exporting errors.
Brings better diagnostic info to system-level operations.
18-20
: Extended usage fromsuper::machine
.
Now referencingCoreReservation
and other specialized structs supports your expanded CPU reservation logic.
477-513
: Efficient core availability calculation.
- Excellent usage of
filter_map
andsorted_by
to list the largest available sockets first.- Consider adding logs that mention how many cores remain unreserved per socket for debugging.
579-613
: Bulk core reservation.
- The check at line 580 quickly returns if there aren’t enough resources, preventing partial reservations.
- The approach to flatten available cores and pick the first
count
is straightforward, but it might not always produce an optimal distribution. That is typically acceptable unless there's a performance reason to prefer a different strategy.
615-640
: Selective reservation by ID.
- Reusing
calculate_available_cores()
is consistent.- The loop attempts to reserve each requested core, ignoring nonexistent or already-reserved cores.
- The final
Ok(result)
is straightforward.crates/rqd/src/system/machine.rs (9)
4-4
: Introducedtime::Instant
.
This will help measure how long cores remain allocated.
16-17
: Extended tracing imports andUuid
.
Enables granular logging and resource-specific identification.
21-21
: Usingframe::cache::RunningFrameCache
.
Allows theMachineMonitor
direct access to the list of running frames, bridging LRU or other caching strategies.
181-189
: MachineMonitor startup logic.
These lines revolve around initialhost_state
retrieval and broadcast. The design effectively sets up the initial conditions for new frames and delegates repeated checks to the main loop.
223-228
: Machine traitreserve_cores
adaptation.
Matching new arguments(num_cores, resource_id)
, bridging monitor to system controller.
230-239
:reserve_cores_by_id
bridging.
Mirrors the same approach for selective CPU allocation. Solid for reattaching to partially tracked frames.
241-254
: Iterative release of cores.
- Each core is released individually, ignoring
NotFoundError
with a warning.- This approach ensures partial release success even if some cores were not recognized.
343-355
: Trait-defined core reservation.
UsingUuid
as the resource identifier is a clean way to track the entity that owns each reservation. This extension updates the contract for all implementers.
375-375
: Switched toCoreReservation
inCpuStat
.
Organizing reserved cores with unique owners is beneficial for better debugging and future expansions (like timestamps).
All frames create a snapshot right after being spawned with the intention of allowing a future recover in
case rqd restarts. During start time, rqd reads all snapshots on the configured directory
(
config.runner.snapshots_path
) and attempts to reconnect to all snapshot'ed frames.When rqd restarts all frames currently running become parented by the init/systemd, therefore the
recovery process is no longer able to bind to them and follow their execution directly.
To accomplish a similar behavior, on recovery mode, running_frame polls it's pid waiting for it to
finish and recovers the exit status and signal from a file. To accomplish that, all frame commands are
now appended with a logic to save their exit status to a file.
Summary by CodeRabbit
New Features
FrameManager
for managing frame lifecycles and improved error handling during frame operations.Refactor
Bug Fixes