Use page tracking for snapshot and restore #683

simongdavies · 2025-07-01T20:40:44Z

This pull request introduces snapshotting and restoring of sandbox state using dirty page tracking rather than copying the entire memory state each time, in many/most scenarios where sandboxes have larger than default amounts of memory this results in better perforamnce and reduced memory usage.

However, due to inefficiencies in the way dirty page tracking works on mshv and the fact that the implementation of tracking is not done for Windows this is not universally true.

Included below are screenshots showing the difference in time taken for some benchmarks with and without dirty page tracking enabled, along with some explanations of the differences seen.

The changes comprise:

Host Shared Memory Dirty Page tracking

Added a custom signal handler for SIGSEGV to support dirty page tracking for host memory mapped into a VM. Updated documentation to reflect this change and added debugging instructions for handling SIGSEGV in GDB and LLDB.

VM Dirty page tracking

Enabled dirty page tracking for mshv and KVM drivers to track changes made to memory in the guest

Snapshot Management

Added a new snapshot manager module to create, manage and restore memory snapshots.

Benchmarking Enhancements

Refactored the guest_call_benchmark_large_param function to support benchmarks for multiple parameter sizes and added a new sandbox_heap_size_benchmark function to measure sandbox creation performance with varying heap sizes.
Introduced guest_call_heap_size_benchmark to evaluate guest function call performance with different heap sizes.

Performance Changes

KVM Before

KVM After

KVM shows massive positive changes in performance when creating large sandboxes and calling functions in sandboxes with large memory configurations , although not measured the amount of memory consumed should have reduced considerably. In the scenario where large parameters are passed to the sandbox there is a degradation of performance, this has not been investigated yet, it may be due to the page based mechanism for saving/restoring data being more expensive than copying and restoring all the data. Other work to make large parameter passing more efficient will likely have a large positive impact here as well.

There is some regression in performance for small/default sandbox sizes which is more than likely caused by the overhead of tracking and building/restoring page based snapshots for small memory vs. copying and restoring the entire memory.

mshv2 Before

mshv2 After

mshv3 Before

mshv3 After

Both mshv2 and mshv3 show similar patterns to KVM in that the larger sandbox sizes show large improvements, the default/small sandboxes show regressions and the large parameter sizes show regressions, possibly for the same reason that the KVM one does , again no investigation done here yet.

The biggest difference between KVM and mshv is for two reasons, first when enabling dirty page tracking on mshv the first call to get dirty pages results in a returned bitmap showing all pages dirty, since this would cause us to snapshot all memory as a baseline after enabling dirty page tracking this PR gets the dirty pages immediately and then discards the result. This approach means that we have to make 2 calls to get dirty pages when we really only need one. The impact of this is quite large, especially with larger memory configurations since the call to get dirty pages seems to be O(n) where n is the number of pages in the memory configuration (regardless of if the pages are dirty or not), for a 950mb VM I observed ~1.4ms response for this call, however, this approach with larger sandboxes is still much quicker than copying all the memory. Fixing these issues in mshv will probably bring the performance much closer to KVM.

Windows 2025 Before

Windows 2025 After

Windows is largely either the same performance or has regressed, this is because the Windows implementation has not been done yet, at the moment each time dirt pages are requested Windows reports that all pages are dirty and the snapshots/restores are done on that basis, there is some overhead of this approach especially when restoring

ludfjig

Looks mostly good to me. The signal handling of SIGSEGV makes me a little nervous, and there seems to be an assumption that Vec<MemoryRegions> are always consecutive in memory, which i think we might break in the future. I also think a refactor of our memory management code would be nice 😅 .

I also remember I had a bunch of tests for evolve/devolve in my previous dirty-pages PR that maybe could be useful to test logic. Those tests were the ones that allowed me to catch the bug in mshv's implementation :P

Also, these regression seems pretty big (compared to main branch)

src/hyperlight_host/src/mem/shared_mem.rs

src/hyperlight_host/src/mem/linux_dirty_page_tracker.rs

ludfjig · 2025-07-03T18:54:03Z

src/hyperlight_host/src/hypervisor/kvm.rs

+            for page_idx in bit_index_iterator(&bitmap) {
+                page_indices.push(current_page + page_idx);
+            }
+            current_page += num_pages;


I think we are making an assumption here that all memory regions are consecutive with no gaps. I think Vec<MemoryRegion> has this invariant, but not sure if documented anywhere

src/hyperlight_host/src/mem/mgr.rs

ludfjig · 2025-07-03T19:02:58Z

src/hyperlight_host/src/mem/shared_memory_snapshot_manager.rs

+            shared_mem.with_exclusivity(|e| e.copy_to_slice(&mut buffer, 0))??;
+        } else {
+            // Sort pages for deterministic ordering and to enable consecutive page optimization
+            dirty_pages.sort_unstable();


I believe this sort is redundant

ludfjig · 2025-07-03T19:15:01Z

src/hyperlight_host/src/mem/shared_memory_snapshot_manager.rs

+
+        // Collect dirty pages and sort them for consecutive page optimization
+        let mut dirty_pages: Vec<usize> = bit_index_iterator(dirty_bitmap).collect();
+        dirty_pages.sort_unstable();


I believe this sort is redundant

src/hyperlight_host/src/mem/mgr.rs

ludfjig · 2025-07-03T19:25:24Z

src/hyperlight_host/src/mem/shared_memory_snapshot_manager.rs

+    pub(super) fn create_new_snapshot<S: SharedMemory>(
+        &mut self,
+        shared_mem: &mut S,
+        dirty_page_map: Option<&Vec<u64>>,


I think(?) this option can also be removed

src/hyperlight_host/src/mem/shared_memory_snapshot_manager.rs

ludfjig · 2025-07-03T19:28:59Z

src/hyperlight_host/src/hypervisor/hyperv_windows.rs

@@ -606,6 +607,12 @@ impl Hypervisor for HypervWindowsDriver {
        Ok(())
    }

+    fn get_and_clear_dirty_pages(&mut self) -> Result<Vec<u64>> {
+        // For now we just mark all pages dirty which is the equivalent of taking a full snapshot
+        let total_size = self.mem_regions.iter().map(|r| r.guest_region.len()).sum();


Can we use https://learn.microsoft.com/en-us/virtualization/api/hypervisor-platform/funcs/whvquerygparangedirtybitmap ?

Signed-off-by: Simon Davies <[email protected]>

… sizes Signed-off-by: Simon Davies <[email protected]>

- Implemented `LinuxDirtyPageTracker` for tracking dirty pages in Linux. - Utilizes SIGSEGV to detect writes and marks pages as dirty. - Supports concurrent access and ensures memory protection. - Includes comprehensive tests for various scenarios including overlap detection and concurrent writes. - Added `WindowsDirtyPageTracker` as a placeholder for Windows dirty page tracking. - Currently marks all pages as dirty until further implementation is completed. - Includes basic structure and initialization logic. Signed-off-by: Simon Davies <[email protected]>

…g instructions Signed-off-by: Simon Davies <[email protected]>

Signed-off-by: Simon Davies <[email protected]>

…state reset validation Signed-off-by: Simon Davies <[email protected]>

Signed-off-by: Simon Davies <[email protected]>

…ge tracking after memory is allocated and stopping it once the uninitialized sandbox is evolved Signed-off-by: Simon Davies <[email protected]>

…memory from them. Signed-off-by: Simon Davies <[email protected]>

Signed-off-by: Simon Davies <[email protected]>

… offset methods to be public Signed-off-by: Simon Davies <[email protected]>

…managing memory snapshots and update related methods for improved state management Signed-off-by: Simon Davies <[email protected]>

… shared_memory_snapshot_manager for improved snapshot management Signed-off-by: Simon Davies <[email protected]>

…lated methods Signed-off-by: Simon Davies <[email protected]>

…lve methods Signed-off-by: Simon Davies <[email protected]>

…ethod Signed-off-by: Simon Davies <[email protected]>

…ptions in parameters. Signed-off-by: Ludvig Liljenberg <[email protected]>

simongdavies added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Jul 1, 2025

simongdavies linked an issue Jul 1, 2025 that may be closed by this pull request

Improve performance when creating "larger" Sanboxes #642

Open

simongdavies force-pushed the update-snapshot-and-restore branch 5 times, most recently from 7e85120 to 014eab3 Compare July 3, 2025 15:48

This was referenced Jul 3, 2025

Implement Dirty Page Tracking on Windows #688

Open

Create repro for mshv get dirty page issues #689

Open

Investigate why page tracking changes make calling functions with large parameters slower than previous implementation #690

Open

simongdavies changed the title ~~WIP: Use page tracking for snapshot and restore~~ Use page tracking for snapshot and restore Jul 3, 2025

simongdavies force-pushed the update-snapshot-and-restore branch from 014eab3 to 56fd983 Compare July 3, 2025 16:43

ludfjig reviewed Jul 3, 2025

View reviewed changes

ludfjig force-pushed the update-snapshot-and-restore branch 3 times, most recently from 3838d0a to 411a668 Compare July 8, 2025 18:36

simongdavies added 14 commits July 8, 2025 13:45

Update dependencies

05069e1

Signed-off-by: Simon Davies <[email protected]>

Add benchmarks for guest calls with large parameters and sandbox heap…

a01a0a4

… sizes Signed-off-by: Simon Davies <[email protected]>

Enhance signal handling documentation to include SIGSEGV and debuggin…

48380fb

…g instructions Signed-off-by: Simon Davies <[email protected]>

Implement dirty page tracking for Hyperv and KVM drivers

d11436a

Signed-off-by: Simon Davies <[email protected]>

Improve error handling in interrupt_same_thread test and add sandbox …

a8dad96

…state reset validation Signed-off-by: Simon Davies <[email protected]>

Add bitmap module with helper functions for page management

78934ae

Signed-off-by: Simon Davies <[email protected]>

Add PageSnapshot struct for efficient memory snapshot management

2128991

Signed-off-by: Simon Davies <[email protected]>

Integrate DirtyPageTracker into UninitializedSandbox starting host pa…

df9e7f2

…ge tracking after memory is allocated and stopping it once the uninitialized sandbox is evolved Signed-off-by: Simon Davies <[email protected]>

Implement a new module to create/manage snapshots and restore shared …

ea320a2

…memory from them. Signed-off-by: Simon Davies <[email protected]>

Add constant for the number of pages in a block to track dirty pages

113ca64

Signed-off-by: Simon Davies <[email protected]>

Add method to retrieve and clear dirty pages as a bitmap

cb0e562

Signed-off-by: Simon Davies <[email protected]>

Make stack pointer size constant and update output/input data pointer…

ec75763

… offset methods to be public Signed-off-by: Simon Davies <[email protected]>

Refactor SandboxMemoryManager to use SharedMemorySnapshotManager for …

1d43aec

…managing memory snapshots and update related methods for improved state management Signed-off-by: Simon Davies <[email protected]>

simongdavies added 2 commits July 8, 2025 13:46

Refactor memory module to remove shared_mem_snapshot and replace with…

d0352d3

… shared_memory_snapshot_manager for improved snapshot management Signed-off-by: Simon Davies <[email protected]>

Enhance shared memory management by adding dirty page tracking and re…

a07c4e0

…lated methods Signed-off-by: Simon Davies <[email protected]>

ludfjig force-pushed the update-snapshot-and-restore branch from 411a668 to 1231cc6 Compare July 8, 2025 20:59

simongdavies and others added 3 commits July 8, 2025 14:07

Update MultiUseSandbox to use dirty page tracking in restore and devo…

5d4b7e6

…lve methods Signed-off-by: Simon Davies <[email protected]>

Fix test setup to unpack manager from load_guest_binary_into_memory m…

cefc20a

…ethod Signed-off-by: Simon Davies <[email protected]>

Remove dirty_pages from HostMapping struct, remove some unnecessary O…

9d4d90a

…ptions in parameters. Signed-off-by: Ludvig Liljenberg <[email protected]>

ludfjig force-pushed the update-snapshot-and-restore branch from 1231cc6 to 9d4d90a Compare July 8, 2025 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use page tracking for snapshot and restore #683

Use page tracking for snapshot and restore #683

simongdavies commented Jul 1, 2025 •

edited

Loading

Uh oh!

ludfjig left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Uh oh!

ludfjig Jul 3, 2025

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Uh oh!

Uh oh!

Use page tracking for snapshot and restore #683

Are you sure you want to change the base?

Use page tracking for snapshot and restore #683

Conversation

simongdavies commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Host Shared Memory Dirty Page tracking

VM Dirty page tracking

Snapshot Management

Benchmarking Enhancements

Performance Changes

KVM Before

KVM After

mshv2 Before

mshv2 After

mshv3 Before

mshv3 After

Windows 2025 Before

Windows 2025 After

Uh oh!

ludfjig left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ludfjig Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ludfjig Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simongdavies commented Jul 1, 2025 •

edited

Loading

ludfjig left a comment •

edited

Loading