Skip to content

Dynamic side metadata base address#1449

Open
qinsoon wants to merge 28 commits intommtk:masterfrom
qinsoon:dynamic-side-metadata-address
Open

Dynamic side metadata base address#1449
qinsoon wants to merge 28 commits intommtk:masterfrom
qinsoon:dynamic-side-metadata-address

Conversation

@qinsoon
Copy link
Member

@qinsoon qinsoon commented Feb 19, 2026

This addresses part of the issues with #1351. This PR allows side metadata to be mmapped dynamically, and also allows side metadata to use a fixed address specified in the option side_metadata_base_address.

Use a runtime-mapped base for side metadata and make offsets relative.
Replace OnceLock with a fast static base and shared init, update address math,
and adjust tests/spec layout and docs accordingly.

Also add mmap-noreserve-anywhere support and guard MSRV pin in CI scripts.
Avoid 32-bit overflow in mmapper range limit, adjust side metadata
sanity expectations, and use small 32-bit test addresses for contiguous
conversion tests. Also constrain mmap annotation handling on 32-bit.
Verify the runtime side metadata base is initialized, aligned, and
that global metadata addresses fall within the reserved range. Check
64-bit local base offset consistency.
@qinsoon
Copy link
Member Author

qinsoon commented Feb 19, 2026

I used cargo bench to quickly test the performance.

side_metadata_address_to_meta_address() is 10x slower. With a constant side metadata base address, this function should be completely optimized away with constant folding. But using a dynamically mapped base address, this function needs to be computed.

SideMetadata::load() is roughly 3x slower.

=== This PR ===
side_metadata_address_translation       time:   [4.7127 µs 4.7142 µs 4.7163 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
side_metadata_load      time:   [8.8293 µs 8.8310 µs 8.8328 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild

=== master ===
  side_metadata_address_translation
                        time:   [505.57 ns 505.73 ns 505.88 ns]
                        change: [−89.293% −89.285% −89.276%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  6 (6.00%) low mild
  2 (2.00%) high mild

side_metadata_load      time:   [2.6913 µs 2.7012 µs 2.7121 µs]
                        change: [−69.579% −69.518% −69.453%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  19 (19.00%) high severe

I will run dacapo benchmarks.

into consideration when quarantine side metadata
@qinsoon
Copy link
Member Author

qinsoon commented Feb 22, 2026

The following is the performance for Immix using 3x G1 min heap:
https://squirrel.anu.edu.au/plotty/yilin/mmtk/#0|shrew-2026-02-20-Fri-004537&benchmark^build^invocation^iteration&time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;jdk-21-constant-side-metadata|41&Histogram%20(with%20CI)^build^benchmark&

Immix only uses side metadata during GC, and the mean for STW time is 3.2% slowdown. The worst case is around 11% STW time slowdown (biojava).

I probably should measure a generational plan which uses log bits side metadata during mutator time.

verify_side_metadata_sanity() from Plan::new() to MMTK::new() after
mmapping side metadata.
@qinsoon
Copy link
Member Author

qinsoon commented Feb 23, 2026

This is the result for GenImmix (also using 3x G1 min heap):
https://squirrel.anu.edu.au/plotty/yilin/mmtk/#0|shrew-2026-02-20-Fri-004537&benchmark^build^invocation^iteration&time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&build;jdk-21-constant-side-metadata|41&Histogram%20(with%20CI)^build^benchmark&

Generally it showed no slowdown for mutator time. The reason is that there is little change to the JIT'd code. The only difference is that at JIT time, we need to load the side metadata address from a variable instead of a constant -- but this only affects JIT time, not run time. See https://github.com/mmtk/mmtk-openjdk/pull/343/changes#diff-27141e7f6636a2ef36ac24f81d7025e231f1b091660a28026edba98de5c1156a.

It also showed no slowdown for STW time. I think the reason is that most GCs are nursery GCs (CopySpace), and CopySpace only uses side log bits (forwarding bits/pointers are header metadata). So compared to full heap Immix (which uses a lot of side metadata), generational GCs are less affected by this PR.

@qinsoon
Copy link
Member Author

qinsoon commented Feb 23, 2026

I will further look into the micro benchmarks and the Immix performance.

}

/// Performs the translation of data address (`data_addr`) to metadata address for the specified metadata (`metadata_spec`).
#[inline(always)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This #[inline(always)] directive is necessary. With this inlined, the microbenchmark shows no slowdown on loading side metadata.

side_metadata_load      time:   [2.6933 µs 2.6940 µs 2.6948 µs]
                        change: [−68.453% −68.423% −68.397%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild

I am running Immix again, and see if the inline fixes the slowdown for Immix.

Note that I use PGO when building OpenJDK. If the manual inline directive fixes the slowdown for Immix, that means somehow PGO does not inline the key function for side metadata address calculation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... With this inlined, the microbenchmark shows no slowdown...

That's right. The compiler can usually figure out what to inline in real-world VM bindings, but for microbenchmarks (cargo bench), manual inlining matters. I have observed this before, and added some annotations in src/util/test_private/mod.rs. I think you can try adding wrappers in the test_private mod that have #[inline(always)] and keep side_metadata/helpers.rs free of inlining annotations. If manual inlining is still necessary for the OpenJDK binding, we can keep #[inline(always)] in helpers.rs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qinsoon
Copy link
Member Author

qinsoon commented Feb 24, 2026

binding-refs
OPENJDK21_BINDING_REPO=qinsoon/mmtk-openjdk
OPENJDK21_BINDING_REF=dynamic-side-metadata-address
OPENJDK11_BINDING_REPO=qinsoon/mmtk-openjdk
OPENJDK11_BINDING_REF=dynamic-side-metadata-address-11
JULIA_BINDING_REPO=qinsoon/mmtk-julia
JULIA_BINDING_REF=dynamic-side-metadata-address
JULIA_VM_REPO=qinsoon/julia
JULIA_VM_REF=no-const-side-metadata-address
RUBY_BINDING_REPO=qinsoon/mmtk-ruby
RUBY_BINDING_REF=dynamic-side-metadata-address
JIKESRVM_BINDING_REPO=qinsoon/mmtk-jikesrvm
JIKESRVM_BINDING_REF=dynamic-side-metadata-address

@qinsoon qinsoon force-pushed the dynamic-side-metadata-address branch from cdd59e9 to b4c4539 Compare February 24, 2026 04:29
// (starting from zero) and add the runtime base address when computing actual addresses.
pub(crate) const GLOBAL_SIDE_METADATA_BASE_OFFSET: usize = 0;

static mut SIDE_METADATA_BASE_ADDRESS: Address = Address::ZERO;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this static mut with a OnceLock, and got the following results on biojava which showed no slowdown for this PR.
https://squirrel.anu.edu.au/plotty/yilin/mmtk/#0|shrew-2026-02-26-Thu-071955&benchmark^build^invocation^iteration^stickyix&time^time.other^time.stw&|10&iteration^1^4&stickyix^1^null|20&1^invocation|30&1&benchmark&build;jdk-21-constant-side-metadata

If we use static mut for the base address, Rust needs to repetitively load from the variable every time we access side metadata. I inspected the assembly code of ImmixSpace::trace_object, and in around 100 instructions, there are 3 extra loads for this PR to get the side metadata address.

Intuitively OnceLock should be slower, as get() always checks if the value is initialized or not before loading it. However, if we use oncelock.get().unwrap_unchecked(), it seems the compiler can assume that there is no need to check or branch, and the value will not be changed so there is no need to repetitively load from it.

Using OnceLock seems to fix the slowdown on biojava, and I am running the experiments again for all the benchmarks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will still need to investigate the slowdown.

@qinsoon qinsoon added the PR-extended-testing Run extended tests for the pull request label Mar 6, 2026
Comment on lines +217 to +220
// Initialize side metadat sanity first
plan.verify_side_metadata_sanity();
// Then intiialize SFT because it may use side metadata
plan.initialize_sft();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines use side metadata, so they have to happen after side metadata is initialized. They used to happen in Plan::new(). I extracted them, as at some point, I called initialize_side_metadata() after Plan::new(). So I had to move those two lines after Plan::new() and initialize_side_metadata().

Now initialize_side_metadata() is called before Plan::new(). It is not necessary to have these two lines here. But I think this is still clearer.

@qinsoon qinsoon marked this pull request as ready for review March 9, 2026 02:26
Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when quarantining memory, the mmap strategy is always the same. As I suggested in the comments, we may remove some strategy argument and use a fixed strategy for quarantining.

let addr = unix_common::mmap_anywhere(
size,
align,
strategy.prot(MmapProtection::NoAccess).reserve(false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .prot(MmapProtection::NoAccess).reserve(false) should be orthogonal to dzmamap_anywhere. it is ChunkStateMmapper::quarantine_address_range_anywhere that requires the map to be "no access" and "no reserve" because it is "quarantine".

);
let base = if specified_base.is_zero() {
MMAPPER
.quarantine_address_range_anywhere(pages, MmapStrategy::SIDE_METADATA, &anno)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to pass MmapStrategy::SIDE_METADATA to quarantine_address_range_anywhere because when quarantining, it must have MmapProtection::NoAccess and reserve: true.

fn quarantine_address_range_anywhere(
&self,
pages: usize,
strategy: MmapStrategy,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the strategy to be passed in. Like the following excerpt from quarantine_address_range, we can construct the strategy locally:

                        let mmap_strategy = MmapStrategy::default()
                            .huge_page(huge_page_option)
                            .prot(MmapProtection::NoAccess)
                            .reserve(false)
                            .replace(false);
                        OS::dzmmap(group_start, group_bytes, mmap_strategy, anno)?;

We can probably add a constant in impl MmapStrategy:

                        pub const QUARANTINE = MmapStrategy::default()
                            .prot(MmapProtection::NoAccess)
                            .reserve(false)
                            .replace(false);
                        OS::dzmmap(group_start, group_bytes, mmap_strategy, anno)?;

so that we can directly use MmapStrategy::QUARANTINE.huge_page(huge_page_option).

}

let start = Address::from_mut_ptr(ptr);
Ok(start.align_up(align))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we align up the start, the returned value is no longer the starting address of the mmap, and subsequent munmap may have problems. For example, ChunkStateMmapper::quarantine_address_range_anywhere will try to call munmap with the returned address from mmap, and leave memory still mapped at the beginning and the end.

If the mmap is successful, we should either

  • immediately munmap the extraneous memory range in the beginning and in the end to make the resulting mmap exactly start...(start+aligned_size), or
  • return the actual starting address and the alloc_size, and let the caller do the munmap.

@wks
Copy link
Collaborator

wks commented Mar 9, 2026

A high-level comment: OnceLock has a method OnceLock::get_unchecked() that can bypass the intermediate Option<&T> from OnceLock::get(). I haven't tried whether it is faster than l.get().unwrap_unchecked(), but it is worth trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR-extended-testing Run extended tests for the pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants