Skip to content

Conversation

gurasinghMS
Copy link
Contributor

@gurasinghMS gurasinghMS commented Oct 16, 2025

This PR adds a new queue fault for the admin queue called verify which is modeled as a Oneshot sender. When the fault controller observes a matching command, it completes the oneshot channel. This indicates to the test that a matching command was observed. The test controller at this point disposes off the sender. Signal is only sent for the first matching command.

This also adds a new vmm test that leverages the Verify functionality to check for AER and GET_LOG_PAGE admin commands after restore is complete. The test waits 10s (max) for each command, failing the test if no command is seen in that time frame.

@gurasinghMS gurasinghMS force-pushed the servicing-with-namespace-change branch from 5058934 to 08e73ea Compare October 16, 2025 16:48
Copy link
Contributor

@mattkur mattkur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still grokking this. Will take another look tomorrow morning. Thanks for this work.

@gurasinghMS gurasinghMS changed the title wip: test keepalive servicing with a namespace change after save nvme_driver: test keepalive servicing with a namespace change after save Oct 20, 2025
@gurasinghMS gurasinghMS marked this pull request as ready for review October 20, 2025 22:15
@gurasinghMS gurasinghMS requested review from a team as code owners October 20, 2025 22:15
@Copilot Copilot AI review requested due to automatic review settings October 20, 2025 22:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a test that verifies NVMe keepalive functionality during OpenHCL servicing when a namespace change occurs. The test ensures that the controller properly handles namespace changes during servicing by verifying that appropriate asynchronous event requests (AER) and GET_LOG_PAGE commands are observed after restoration.

Key changes:

  • Adds a new test servicing_keepalive_with_namespace_update that triggers a namespace change during servicing and verifies the controller's response
  • Introduces a Verify variant to QueueFaultBehavior that signals when a matching command is observed
  • Removes Clone trait from QueueFaultBehavior and AdminQueueFaultConfig to support the new verification mechanism

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs Adds test for namespace change during servicing with keepalive enabled, extracts vtl2_nsid constant
vm/devices/storage/nvme_test/src/workers/admin.rs Updates fault handling to use mutable references and adds Verify behavior support
vm/devices/storage/nvme_resources/src/fault.rs Adds Verify variant to QueueFaultBehavior and removes Clone trait

@gurasinghMS gurasinghMS changed the title nvme_driver: test keepalive servicing with a namespace change after save vmm_test: test keepalive servicing with a namespace change after save Oct 20, 2025
vm.restore_openhcl().await?;

let _ = CancelContext::new()
.with_timeout(Duration::from_secs(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is reasonable, but you may find that this is flakey if machines are under load. Let's see ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep an eye out for flakiness from this! New test viewer should help :)

@gurasinghMS gurasinghMS merged commit 962a5bd into microsoft:main Oct 21, 2025
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants