Skip to content

Use mincore(2) to create diff snapshots without dirty page tracking #5274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@ and this project adheres to
- [#5175](https://github.com/firecracker-microvm/firecracker/pull/5175): Allow
including a custom cpu template directly in the json configuration file passed
to `--config-file` under the `cpu_config` key.
- [#5274](https://github.com/firecracker-microvm/firecracker/pull/5274): Allow
taking diff snapshots even if dirty page tracking is disabled, by using
`mincore(2)` to overapproximate the set of dirty pages. Only works if swap is
disabled.

### Changed

Expand All @@ -25,6 +29,10 @@ and this project adheres to

### Deprecated

- [#5274](https://github.com/firecracker-microvm/firecracker/pull/5274):
Deprecated the `enable_diff_snapshots` parameter of the `/snapshot/load` API.
Use `track_dirty_pages` instead.

### Removed

### Fixed
Expand Down
2 changes: 2 additions & 0 deletions DEPRECATED.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@ a future major Firecracker release, in accordance with our
The functionality is substituted with ACPI.
- \[[#2628](https://github.com/firecracker-microvm/firecracker/pull/2628)\] The
`--basic` parameter of `seccompiler-bin`.
- \[[#5274](https://github.com/firecracker-microvm/firecracker/pull/5274)\]: The
`enable_diff_snapshots` body field in `PUT` requests on `/snapshot/load`
124 changes: 62 additions & 62 deletions docs/device-api.md

Large diffs are not rendered by default.

13 changes: 8 additions & 5 deletions docs/hugepages.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,7 @@ pool, please refer to the [Linux Documentation][hugetlbfs_docs].
Restoring a Firecracker snapshot of a microVM backed by huge pages will also use
huge pages to back the restored guest. There is no option to flip between
regular, 4K, pages and huge pages at restore time. Furthermore, snapshots of
microVMs backed with huge pages can only be restored via UFFD. Lastly, note that
even for guests backed by huge pages, differential snapshots will always track
write accesses to guest memory at 4K granularity.
microVMs backed with huge pages can only be restored via UFFD.

When restoring snapshots via UFFD, Firecracker will send the configured page
size (in KiB) for each memory region as part of the initial handshake, as
Expand All @@ -40,12 +38,17 @@ Firecracker features:

- Memory Ballooning via the [Balloon Device](./ballooning.md)

Furthermore, enabling dirty page tracking for hugepage memory negates the
performance benefits of using huge pages. This is because KVM will
unconditionally establish guest page tables at 4K granularity if dirty page
tracking is enabled, even if the host users huge mappings.

## FAQ

### Why does Firecracker not offer a transparent huge pages (THP) setting?

Firecracker's guest memory is memfd based. Linux (as of 6.1) does not offer a
way to dynamically enable THP for such memory regions. Additionally, UFFD does
Firecracker's guest memory can be memfd based. Linux (as of 6.1) does not offer
a way to dynamically enable THP for such memory regions. Additionally, UFFD does
not integrate with THP (no transparent huge pages will be allocated during
userfaulting). Please refer to the [Linux Documentation][thp_docs] for more
information.
Expand Down
116 changes: 55 additions & 61 deletions docs/snapshotting/snapshot-support.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ the feature can be combined with guest_memfd support in Firecracker.

### Limitations

- High snapshot latency on 5.4+ host kernels due to cgroups V1. We strongly
- High snapshot restoration latency when cgroups V1 are in use. We strongly
recommend to deploy snapshots on cgroups V2 enabled hosts for the implied
kernel versions -
[related issue](https://github.com/firecracker-microvm/firecracker/issues/2129).
Expand All @@ -145,10 +145,11 @@ the feature can be combined with guest_memfd support in Firecracker.
resumed from snapshot load memory on-demand from the snapshot and
copy-on-write to anonymous memory.
- Resuming from a snapshot is optimized for speed, while taking a snapshot
involves some extra CPU cycles for synchronously writing dirty memory pages to
the memory snapshot file. Taking a snapshot of a fresh microVM, on which dirty
pages tracking is not enabled, results in the full contents of guest memory
being written to the snapshot.
involves some extra CPU cycles for synchronously writing memory pages to the
memory snapshot file. Taking a full snapshot of a microVM, on which dirty page
tracking is not enabled, results in the full contents of guest memory being
written to the snapshot, and particularly, in all guest memory being faulted
in.
- The _memory file_ and _microVM state file_ are generated by Firecracker on
snapshot creation. The disk contents are _not_ explicitly flushed to their
backing files.
Expand Down Expand Up @@ -207,23 +208,17 @@ the microVM in the `Paused` state. **Effects**:
Now that the microVM is paused, you can create a snapshot, which can be either a
`full`one or a `diff` one. Full snapshots always create a complete, resume-able
snapshot of the current microVM state and memory. Diff snapshots save the
current microVM state and the memory dirtied since the last snapshot (full or
diff). Diff snapshots are not resume-able, but can be merged into a full
snapshot. In this context, we will refer to the base as the first memory file
created by a `/snapshot/create` API call and the layer as a memory file created
by a subsequent `/snapshot/create` API call. The order in which the snapshots
were created matters and they should be merged in the same order in which they
were created. To merge a `diff` snapshot memory file on top of a base, users
should copy its content over the base. This can be done using the `rebase-snap`
(deprecated) or `snapshot-editor` tools provided with the firecracker release:

`rebase-snap` (deprecated) example:

```bash
rebase-snap --base-file path/to/base --diff-file path/to/layer
```

`snapshot-editor` example:
current microVM state and the memory accessed since the last snapshot (full or
diff). The result of a diff snapshot will be a sparse file, with only accessed
pages written (and other ranges becoming holes). Diff snapshots are not
resume-able, but can be merged into a full snapshot. In this context, we will
refer to the base as the first memory file created by a `/snapshot/create` API
call and the layer as a memory file created by a subsequent `/snapshot/create`
API call. The order in which the snapshots were created matters and they should
be merged in the same order in which they were created. To merge a `diff`
snapshot memory file on top of a base, users should copy its content over the
base. This can be done using the `snapshot-editor` tools provided with the
firecracker release:

```bash
snapshot-editor edit-memory rebase \
Expand Down Expand Up @@ -281,9 +276,9 @@ the snapshot. If they exist, the files will be truncated and overwritten.
contents are only guaranteed to be committed/flushed to the host FS, but not
necessarily to the underlying persistent storage (could still live in host
FS cache).
- If diff snapshots were enabled, the snapshot creation resets then the
dirtied page bitmap and marks all pages clean (from a diff snapshot point of
view).
- If dirty page tracking is enabled, the snapshot creation resets then the
dirtied page bitmap and marks all pages clean (from a dirty page tracking
point of view).

- _on failure_: no side-effects.

Expand Down Expand Up @@ -313,10 +308,23 @@ curl --unix-socket /tmp/firecracker.socket -i \

**Prerequisites**: The microVM is `Paused`.

*Note*: On a fresh microVM, `track_dirty_pages` field should be set to `true`,
when configuring the `/machine-config` resource, while on a snapshot loaded
microVM, `enable_diff_snapshots` from `PUT /snapshot/load`request body, should
be set.
*Note*: Diff snapshots come in two flavor. If `track_dirty_pages` was set to
`true` when configuring the `/machine-config` resource or when restoring from a
snapshot via `/snapshot/load`, Firecracker will use KVM's dirty page log runtime
functionality to ensure the diff snapshot only contains exactly pages that were
written to since boot / snapshot restoration. If `track_dirty_pages` is not
enabled, Firecracker will instead over-approximate the set of pages to include
in the snapshot by instead considering all pages that were _accessed_ during the
VM's lifetime. This potentially results in bigger memory files (although they
are still sparse), but avoids the runtime overhead of dirty page logging.

*Note*: Dirty page tracking negates most of the benefits of
[huge pages](../hugepages.md#known-limitations).

Without dirty page tracking enabled, Firecracker uses the
[`mincore(2)`][man mincore] syscall to determine which pages to include in the
snapshot. As such, this mode of snapshot taking will only work _if swap is
disabled_, as mincore does not consider pages written to swap to be "in core".

**Effects**:

Expand Down Expand Up @@ -350,10 +358,12 @@ Enabling this support enables KVM dirty page tracking, so it comes at a cost
(which consists of CPU cycles spent by KVM accounting for dirtied pages); it
should only be used when needed.

Creating a snapshot will **not** influence state, will **not** stop or end the
microVM, it can be used as before, so the microVM can be resumed if you still
want to use it. At this point, in case you plan to continue using the current
microVM, you should make sure to also copy the disk backing files.
Creating a snapshot has some minor effects on the currently running microVM:

- The vsock device is [reset](#vsock-device-reset), causing the driver to
terminate connection on resumption.
- On x86_64, a notification for KVM-clock is injected to notify the guest about
being paused.

### Resuming the microVM

Expand All @@ -378,8 +388,8 @@ ignored (microVM remains in the running state). **Effects**:
### Loading snapshots

If you want to load a snapshot, you can do that only **before** the microVM is
configured (the only resources that can be configured prior are the Logger and
the Metrics systems) by sending the following API command:
configured (the only resources that can be configured prior are the logger and
the metrics systems) by sending the following API command:

```bash
curl --unix-socket /tmp/firecracker.socket -i \
Expand All @@ -392,7 +402,7 @@ curl --unix-socket /tmp/firecracker.socket -i \
"backend_path": "./mem_file",
"backend_type": "File"
},
"enable_diff_snapshots": true,
"track_dirty_pages": true,
"resume_vm": false
}'
```
Expand Down Expand Up @@ -428,7 +438,7 @@ curl --unix-socket /tmp/firecracker.socket -i \
-d '{
"snapshot_path": "./snapshot_file",
"mem_file_path": "./mem_file",
"enable_diff_snapshots": true,
"track_dirty_pages": true,
"resume_vm": false
}'
```
Expand Down Expand Up @@ -459,35 +469,17 @@ to the new Firecracker process as they were to the original one.
the guest memory and leads to undefined behavior.
- The file indicated by `snapshot_path`, that is used to load from, is
released and no longer used by this process.
- If `enable_diff_snapshots` is set, then diff snapshots can be taken
afterwards.
- If `track_dirty_pages` is set, subsequent diff snapshots will be based on
KVM dirty page tracking.
- If `resume_vm` is set, the vm is automatically resumed if load is
successful.
- _on failure_: A specific error is reported and then the current Firecracker
process is ended (as it might be in an invalid state).

*Notes*: Please, keep in mind that only by setting to true
`enable_diff_snapshots`, when loading a snapshot, or `track_dirty_pages`, when
configuring the machine on a fresh microVM, you can then create a `diff`
snapshot. Also, `track_dirty_pages` is not saved when creating a snapshot, so
you need to explicitly set `enable_diff_snapshots` when sending
`LoadSnapshot`command if you want to be able to do diff snapshots from a loaded
microVM. Another thing that you should be aware of is the following: if a fresh
microVM can create diff snapshots, then if you create a **full** snapshot, the
memory file contains the whole guest memory, while if you create a **diff** one,
that file is sparse and only contains the guest dirtied pages. With these in
mind, some possible snapshotting scenarios are the following:

- `Boot from a fresh microVM` -> `Pause` -> `Create snapshot` -> `Resume` ->
`Pause` -> `Create snapshot` -> ... ;
- `Boot from a fresh microVM` -> `Pause` -> `Create snapshot` -> `Resume` ->
`Pause` -> `Resume` -> ... -> `Pause` -> `Create snapshot` -> ... ;
- `Load snapshot` -> `Resume` -> `Pause` -> `Create snapshot` -> `Resume` ->
`Pause` -> `Create snapshot` -> ... ;
- `Load snapshot` -> `Resume` -> `Pause` -> `Create snapshot` -> `Resume` ->
`Pause` -> `Resume` -> ... -> `Pause` -> `Create snapshot` -> ... ; where
`Create snapshot` can refer to either a full or a diff snapshot for all the
aforementioned flows.
*Notes*: The `track_dirty_pages` configuration is not saved when creating a
snapshot, so you need to explicitly set `track_dirty_pages` again when sending
the `LoadSnapshot` command if you want to be able to do dirty page tracking
based diff snapshots from a loaded microVM.

It is also worth knowing, a microVM that is restored from snapshot will be
resumed with the guest OS wall-clock continuing from the moment of the snapshot
Expand Down Expand Up @@ -632,3 +624,5 @@ the compatibility table reported below:
For example, a snapshot taken on a m6i.metal host running a 5.10 host kernel can
be restored on a different m6i.metal host running a 6.1 host kernel (but not
vice versa), but could not be restored on a c5n.metal host.

[man mincore]: https://man7.org/linux/man-pages/man2/mincore.2.html
3 changes: 3 additions & 0 deletions resources/seccomp/aarch64-unknown-linux-musl.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@
{
"syscall": "write"
},
{
"syscall": "mincore"
},
{
"syscall": "writev",
"comment": "Used by the VirtIO net device to write to tap"
Expand Down
3 changes: 3 additions & 0 deletions resources/seccomp/x86_64-unknown-linux-musl.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@
{
"syscall": "write"
},
{
"syscall": "mincore"
},
{
"syscall": "writev",
"comment": "Used by the VirtIO net device to write to tap"
Expand Down
22 changes: 13 additions & 9 deletions src/firecracker/src/api_server/request/snapshot.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ use super::super::parsed_request::{ParsedRequest, RequestError};
use super::super::request::{Body, Method, StatusCode};

/// Deprecation message for the `mem_file_path` field.
const LOAD_DEPRECATION_MESSAGE: &str = "PUT /snapshot/load: mem_file_path field is deprecated.";
const LOAD_DEPRECATION_MESSAGE: &str =
"PUT /snapshot/load: mem_file_path and enable_diff_snapshots fields are deprecated.";
/// None of the `mem_backend` or `mem_file_path` fields has been specified.
pub const MISSING_FIELD: &str =
"missing field: either `mem_backend` or `mem_file_path` is required";
Expand Down Expand Up @@ -80,7 +81,8 @@ fn parse_put_snapshot_load(body: &Body) -> Result<ParsedRequest, RequestError> {
// Check for the presence of deprecated `mem_file_path` field and create
// deprecation message if found.
let mut deprecation_message = None;
if snapshot_config.mem_file_path.is_some() {
#[allow(deprecated)]
if snapshot_config.mem_file_path.is_some() || snapshot_config.enable_diff_snapshots {
// `mem_file_path` field in request is deprecated.
METRICS.deprecated_api.deprecated_http_api_calls.inc();
deprecation_message = Some(LOAD_DEPRECATION_MESSAGE);
Expand All @@ -103,7 +105,9 @@ fn parse_put_snapshot_load(body: &Body) -> Result<ParsedRequest, RequestError> {
let snapshot_params = LoadSnapshotParams {
snapshot_path: snapshot_config.snapshot_path,
mem_backend,
enable_diff_snapshots: snapshot_config.enable_diff_snapshots,
#[allow(deprecated)]
track_dirty_pages: snapshot_config.enable_diff_snapshots
|| snapshot_config.track_dirty_pages,
resume_vm: snapshot_config.resume_vm,
network_overrides: snapshot_config.network_overrides,
};
Expand Down Expand Up @@ -180,7 +184,7 @@ mod tests {
backend_path: PathBuf::from("bar"),
backend_type: MemBackendType::File,
},
enable_diff_snapshots: false,
track_dirty_pages: false,
resume_vm: false,
network_overrides: vec![],
};
Expand All @@ -202,15 +206,15 @@ mod tests {
"backend_path": "bar",
"backend_type": "File"
},
"enable_diff_snapshots": true
"track_dirty_pages": true
}"#;
let expected_config = LoadSnapshotParams {
snapshot_path: PathBuf::from("foo"),
mem_backend: MemBackendConfig {
backend_path: PathBuf::from("bar"),
backend_type: MemBackendType::File,
},
enable_diff_snapshots: true,
track_dirty_pages: true,
resume_vm: false,
network_overrides: vec![],
};
Expand Down Expand Up @@ -240,7 +244,7 @@ mod tests {
backend_path: PathBuf::from("bar"),
backend_type: MemBackendType::Uffd,
},
enable_diff_snapshots: false,
track_dirty_pages: false,
resume_vm: true,
network_overrides: vec![],
};
Expand Down Expand Up @@ -276,7 +280,7 @@ mod tests {
backend_path: PathBuf::from("bar"),
backend_type: MemBackendType::Uffd,
},
enable_diff_snapshots: false,
track_dirty_pages: false,
resume_vm: true,
network_overrides: vec![NetworkOverride {
iface_id: String::from("eth0"),
Expand Down Expand Up @@ -306,7 +310,7 @@ mod tests {
backend_path: PathBuf::from("bar"),
backend_type: MemBackendType::File,
},
enable_diff_snapshots: false,
track_dirty_pages: false,
resume_vm: true,
network_overrides: vec![],
};
Expand Down
6 changes: 5 additions & 1 deletion src/firecracker/swagger/firecracker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1245,7 +1245,11 @@ definitions:
enable_diff_snapshots:
type: boolean
description:
Enable support for incremental (diff) snapshots by tracking dirty guest pages.
(Deprecated) Enable dirty page tracking to improve space efficiency of diff snapshots
track_dirty_pages:
type: boolean
description:
Enable dirty page tracking to improve space efficiency of diff snapshots
mem_file_path:
type: string
description:
Expand Down
2 changes: 1 addition & 1 deletion src/vmm/src/persist.rs
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ pub fn restore_from_snapshot(
return Err(SnapshotStateFromFileError::UnknownNetworkDevice.into());
}
}
let track_dirty_pages = params.enable_diff_snapshots;
let track_dirty_pages = params.track_dirty_pages;

let vcpu_count = microvm_state
.vcpu_states
Expand Down
Loading