-
Notifications
You must be signed in to change notification settings - Fork 43
Adjust instance RAM maximum value for 2 TiB gimlets #7918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Turned out we can't just bump up the memory to something much higher than 256 GiB. VMs with 512 GiB memory or more cause the sled to panic:
I managed to get some 384 GiB VMs to run but they aren't all stable, e.g., this one went to the GRUB menu after I ran sysbench for a while. This is what I see in the serial console at the moment:
The complete dump files for some of the panics can be found at |
This has been filed as illumos#17403, with a pending CR to fix it here. |
The illumos#17403 fix has been merged upstream. Once it's pulled into helios, we should be in a position to retry running large guests. |
As of oxidecomputer/illumos-gate@47b4405, that fix has been merged into stlouis. |
The illumos fix allowed instances with >512 GiB memory to come up. I was able to start 756/896/960 GiB instances consistently. But with 1 TiB instances, there appear to be some sort of timeout or slowness in the control plane workflow. The log below shows such an occurrence after propolis HTTP came up (notably, at 06:34:10.494Z, there was a
|
Here are the corresponding log lines from nexus:
|
Looking at the 960 GiB sled-agent log lines,
the propolis state change should be rather quick (<0.5s). The 1 TiB propolis zone simply failed to come up all the way (i.e., instance_watcher wasn't giving up prematurely). Unfortunately I can't find the corresponding propolis log to see what was actually going on - the zone was uninstalled without invoking zone-bundle collection. |
Ok, I got the propolis log this time with another 1 TiB instance by tailing the propolis log directly from outside the zone:
The error it hit was |
Filed oxidecomputer/propolis#903 for follow-up. |
I found some constants which need to be adjusted up in order to allow this to work. |
@pfmooney / @iximeow - I am able to run a 1 TiB memory VM (with 64 vcpus) without problem on dublin using this commit. I ran some sysbench tests (cpu/memory benchmarks) as well as a MySQL benchmark for 12 hours. Things are looking stable to me so I think the omicron side of changes can be merged to main. |
On Dublin, I made a VM with 1100585369600 bytes (1025 GiB) memory and it booted up successfully. However, resizing the VM to 1105954078720 bytes (1030 GiB) or above resulted in the following error, as seen on the serial console:
|
well. i didn't totally mean to close this since 2 TiB gimlets could/should allow VM sizes upwards of 1.5 TiB of memory. but as we discovered above and in the notes on #8160, oxidecomputer/propolis#907 needs to get resolved and that updated Propolis needs to get pulled in before we do that. otherwise OVMF won't know how to do PCI and instances won't have disks. but, 1024 GiB is much more than 256 GiB so we're 4x better than before!! |
After #7837, 2TiB gimlets will have roughly 1667 allocated as VMM reservoir, but the maximum instance RAM is hard coded as 256GiB:
omicron/nexus/src/app/mod.rs
Line 120 in d4a64cd
Customers deploying these gimlets are requesting larger instances than this. In order to support this, Nexus has to be aware that these gimlets have more of a reservoir to allocate from, and offer larger instance sizes accordingly.
The text was updated successfully, but these errors were encountered: