Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRASH aarch64 under QEMU at proc_init mrs x0, ID_AA64MMFR2_EL1 #7315

Open
derekbruening opened this issue Mar 2, 2025 · 5 comments
Open

Comments

@derekbruening
Copy link
Contributor

On Ubuntu22 (where we have to migrate: #7270), every aarch64 test run under QEMU hits a SIGILL from the final MRS instruction in proc_init: mrs x0, ID_AA64MMFR2_EL1.

This shows up as:

qemu: uncaught target signal 4 (Illegal instruction) - core dumped

DR only intercepts SIGSEGV and SIGBUS during init time. I'd like to intercept SIGILL and put these MRS instructions in a try-except: but it's tricky since we use SIGILL for nudges and suspending later on. I may try to add an init-time-only try-except handling.

This is blocking us from moving to Ubuntu22 on Github Actions as mandated: #7270.

@derekbruening derekbruening self-assigned this Mar 2, 2025
derekbruening added a commit that referenced this issue Mar 2, 2025
Under QEMU, proc_init()'s MRS of ID_AA64ISAR2_EL1 causes a fatal SIGILL.
We avoid it when under QEMU.

We'd prefer to use TRY_EXCEPT_ALLOW_NO_DCONTEXT, but proc_init() is
called prior to init-time signal handling being set up: and we'd need
to add SIGILL to the ones caught at init time, which complicates later
uses of SIGILL for NUDGESIG_SIGNUM and suspend_signum (and on x86
XSTATE_QUERY_SIG): so we'd want SIGILL to only work for try-except at
init time. That is all a little too involved to implement right now.

Tested on a local Ubuntu22 x86 machine with aarch64 cross-compilation
where every test failed before this fix and nearly all pass now (the
others fail for other reasons masked by this SIGILL before).

Fixes #7315
@pm215
Copy link

pm215 commented Mar 3, 2025

Emulation of ID_AA64MMFR2_EL1 in qemu-aarch64 should be supported from QEMU version 8.0.0 and later (QEMU commit bc6bd20ee353834). Unfortunately the QEMU in Ubuntu 22.04 is only QEMU 6.2, so it doesn't have that.

PS: in your pullreq the log message quotes the wrong register name: "Skipping MRS of ID_AA64ISAR2_EL1 under QEMU".

@abhinav92003
Copy link
Contributor

Emulation of ID_AA64MMFR2_EL1 in qemu-aarch64 should be supported from QEMU version 8.0.0 and later (QEMU commit bc6bd20ee353834). Unfortunately the QEMU in Ubuntu 22.04 is only QEMU 6.2, so it doesn't have that.

What's the QEMU version in Ubuntu20?

Since this issue was created for Ubuntu22, I assume the mrs worked fine under QEMU on Ubuntu20 (the current OS in our test CI workflows). Or was the code path skipped for some reason?

@pm215
Copy link

pm215 commented Mar 3, 2025

Very old versions of QEMU don't implement the "emulate the kernel's trap-and-emulate of ID register accesses" at all. It looks like dynamorio has support for "check whether ID_AA64ISAR0_EL1 SIGILLs on access" in mrs_id_reg_supported(), so a really old QEMU will be OK because it'll take the same "handle a SIGILL there" path as it would under an old host kernel that doesn't expose HWCAP_CPUID.

(Could we use the same mechanism for ID_AA64MMFR2_EL1 that we do for handling "ID_AA64ISAR0_EL1 might SIGILL" in mrs_id_reg_supported(), by the way? That would have the advantage that you could automatically do the right thing on QEMU 8.0 and later where ID_AA64MMFR2_EL1 is emulated and you might want to look at the field values in it.

Edit: oh, FIXME i#5474 says "we don't actually catch the SIGILL", so in fact on old host kernels this would just crash. If dynamorio is in a position to look at the ELF hwcaps you could alternatively check for HWCAP_CPUID before reading the regs instead.)

Ubuntu 20.04 had QEMU 4.2, though, which is after QEMU introduced support for ID register trap emulation (which came with QEMU commit 37020ff15398 in version 4.0.0), so if we didn't see this problem with Ubuntu 20.04's QEMU then something else must be going on.

@abhinav92003
Copy link
Contributor

Ubuntu 20.04 had QEMU 4.2, though, which is after QEMU introduced support for ID register trap emulation (which came with QEMU commit 37020ff15398 in version 4.0.0), so if we didn't see this problem with Ubuntu 20.04's QEMU then something else must be going on.

Had the mrs_id_reg_supported returned false and skipped read_feature_regs at

if (!mrs_id_reg_supported()) {
, then there should've been an "MRS instruction unsupported" log on the Ubuntu20 QEMU logs (e.g., https://github.com/DynamoRIO/dynamorio/actions/runs/13550324021/job/37872137227), but it's not there (or maybe we're skipping printing some logs in that test output).

Also, we know that mrs_id_reg_supported did not crash in the above example. So presumably the mrs for ID_AA64MMFR2_EL1 worked in Ubuntu20 QEMU. But we don't know why yet.

@derekbruening
Copy link
Contributor Author

I'm thinking let's leave this open for a better solution than just disabling this MRS for xarch_root. I think adding try-except for SIGILL in proc_init() is the best solution which would have the MRS still run on whatever QEMU supported it.

derekbruening added a commit that referenced this issue Mar 3, 2025
Under QEMU, proc_init()'s MRS of ID_AA64MMFR2_EL1 causes a fatal SIGILL.
We avoid it when under QEMU.
(Note that this did not fail on Ubuntu20's QEMU, only on Ubuntu22, but
we avoid under any QEMU for now to unblock progress.)

We'd prefer to use TRY_EXCEPT_ALLOW_NO_DCONTEXT, but proc_init() is
called prior to init-time signal handling being set up: and we'd need
to add SIGILL to the ones caught at init time, which complicates later
uses of SIGILL for NUDGESIG_SIGNUM and suspend_signum (and on x86
XSTATE_QUERY_SIG): so we'd want SIGILL to only work for try-except at
init time. That is all a little too involved to implement right now so
we're
putting in this disabling to unblock Ubuntu22 progress for #7270.
Long-term we probably want to put in the try-except, so we'll leave
#7315 open.

Tested on a local Ubuntu22 x86 machine with aarch64 cross-compilation
where every test failed before this fix and nearly all pass now (the
others fail for other reasons masked by this SIGILL before).

Issue: #7270, #7315
@derekbruening derekbruening removed their assignment Mar 5, 2025
derekbruening added a commit that referenced this issue Mar 18, 2025
Disables the AArch64 proc_init() mrs of ID_AA64MMFR2_EL1 under
STANDALONE_UNIT_TEST to avoid a fatal SIGILL in unit_tests under QEMU.

Tested on an Ubuntu 22 machine in an AArch64 cross-compile build where
unit_tests dies with SIGILL without this change and passes with it.

Issue: #7315
derekbruening added a commit that referenced this issue Mar 18, 2025
Disables the AArch64 proc_init() mrs of ID_AA64MMFR2_EL1 under
STANDALONE_UNIT_TEST to avoid a fatal SIGILL in unit_tests under QEMU.

Tested on an Ubuntu 22 machine in an AArch64 cross-compile build where
unit_tests dies with SIGILL without this change and passes with it.

Issue: #7315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants