Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NMIs to ot_earlgrey #111

Draft
wants to merge 11 commits into
base: ot-earlgrey-9.1.0
Choose a base branch
from

Conversation

AlexJones0
Copy link

This PR adds support for Non-Maskable Interrutps (NMIs) to ot_earlgrey. You can see more detail in each individual commit message. Note that the first three commits are instead from #109 and #110, and are just included as the base ot_aon_timer and ot_alert functionality is needed to test that the NMIs are working properly. This PR is accordingly marked as a draft, but can be reviewed by simply ignoring the three commits before b3e8936. These will be removed as that functionality is added.

To get the NMI tests passing properly, a few commits introduce other changes outside of just NMI support. Most notably, I: switch to use QEMU IRQs instead of Ibex IRQs for alerts (to allow correct propagation of multiple alert signals without clearing the signal in-between), and only clear pending IRQs at the PLIC when they are claimed.

The NMI implementation changes to target/riscv are intended to be minimialistic and implementation-agnostic such that they could be utilised by another RISC-V machine to implement their own NMIs. Refer to section "3.5. Non-Maskable Interrupts" of the RISC-V privileged spec and Ibex's Exceptions and Interrupts documentation, alongside the commit messages, for an explanation of the required changes.

I have verified these changes against ~50 known existing passing tests. These changes have also been tested to make previously failing tests pass: aon_timer_smoketest, and rv_core_ibex_nmi_irq_test (running with an icount shift of icount=7). All tests were built from master on OpenTitan.

For earlgrey-1.0.0, OpenTitan's aon_timer has been updated so that it's
wakeup timer is now a 64-bit counter rather than a 32-bit counter
(whereas the watchdog timer remains 32-bit). This requires
the implementation of register mapping changes with each wakeup count /
threshold register now having a HI/LO counterpart.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Connect the alerts for each of the OpenTitan Earlgrey devices to the
alert handler. This is based on the mappings found in the autogenerated
`hw/top_earlgrey/sw/autogen/top_earlgrey.h` file for the Earlgrey top
found in the OpenTitan repository.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Connects GPIO signals for the four different escalation severities of
OpenTitan Earlgrey's Alert handler.

For now, escalation signal 3 (phase 3) connects to the pwrmgr, where it
causes a shutdown as is the default in Darjeeling. In reality, the alert
handler should cause a reset in phase 3 and populate the rstmgr's
`reset_info` accordingly, but this requires additional work in other
blocks, and as such we simply leave this connected to the pwrmgr
shutdown for now to match Darjeeling's functionality.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Ibex IRQs are defined using the `ibex_irq_set` function such that they
are optimised to only propagate the IRQ signal if there has been a
change in the level (i.e. they are edge-triggered). Despite this, it is
possible for alerts to be sent from harwdare without previously
resetting the alert, for example upon a hardware fault. This behaviour
can be observed in several OpenTitan tests which force alerts several
times without resetting the state of the block in-between.

One possible solution would just be to send a signal with level=1
followed by a signal with level=0, but the more efficient and
maintainable solution is just to use the original `qemu_irq` for this
purpose, as the IbexIRQ wrapping functionality does not make sense in
this case.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Adds support for Non-Maskable Interrupts (NMIs) to the RISC-V QEMU
CPU implementation. NMIs are defined by the RISC-V specification to be
independent to regular interrupts, used only for hardware error
conditions and causing an immediate jump to an implementation-defined
NMI vector running in M-mode, regardless of the values of e.g. MIE/MIP.

The spec states that "the values written to mcause on an NMI are
implementation-defined"; in Ibex we always use an mcause of 31 (and a
corresponding mtvec offset of 0x7C to determine the jump address for the
NMI handler). To aid a more generic implementation, this commit
introduces a generic `nmi_cause` CPU environment value that can be used
alongside the NMI functionality to allow implementations to change the
mcause as they see fit.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Adds support for NMIs to Earlgrey's Ibex wrapper, allowing incoming NMI
IRQ signals to be received by the device, which will then be propagated
through an external IRQ signal to the Ibex itself (the hart) with the
newly introduced IRQ_NMI value. This will allow support for hardware
raising NMIs via GPIO signals between the QEMU devices, such that NMI
support can be added to Earlgrey.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
This initialises Ibex within Earlgrey with the newly defined `nmi_cause`
value as 31, as NMI mcause values are implementation-defined, and Ibex
makes the decision to implement all NMIs with a cause of 31, leading to
an offset of 0x7c from the mtvec base being loaded into mepc upon an NMI
occuring.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
The aon timer should send an NMI signal to Ibex when its watchdog timer
barks. Now that the NMI functionality is in place, we can appropriately
hook up this signal, which should appear before the equivalent IRQ.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
In escalation phase 0, the alert handler sends an NMI to the CPU, which
will interrupt execution so long as the NMI is enabled in Ibex itself.
As such, we connect this functionality in the alert handler.

See: https://opentitan.org/book/sw/device/silicon_creator/rom/doc/shutdown.html#alert-escalation-phase-actions

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
See the comment for more details; to allow for proper NMI functionality
based on regular IRQ and NMI handling routines, we must ensure that in
`sifive_plic_irq_request`, we only clear an IRQs pending status when
that IRQ has been claimed. Failure to do so can mean we clear an IRQ at
a peripheral as part of an NMI service routine and as such never call
into the IRQ handler when the NMI handler returns, as would be expected.
This commit updates the SiFive PLIC so that it works as expected.

Signed-off-by: Alex Jones <alex.jones@lowrisc.org>
@rivos-eblot
Copy link

I think we'll need to some time to review this MR. We also need to check if there is no prior NMI support upstream (I remember working on resumable-NMI for QEMU RISC-V with another project). A couple of notes:

  • you cannot neither add OpenTitan references to sifive_plic.c, as it is a generic file, nor OpenTitan specific behavior there.
  • I'm not sure to understand the requirement for reverting to qemu_irq for alerts. Shall we need it, there are more changes to implement in every modified file to avoid propating no-change to the PLIC every time the alert value is re-evaluated to the very same value. This is an issue performance wise, and this is a nightmare when it comes to debugging the PLIC :). Can you elaborate a bit more on this requirement? Thanks.
  • we need to check how this impact other features such as the dret instruction

@AlexJones0
Copy link
Author

AlexJones0 commented Jan 22, 2025

We also need to check if there is no prior NMI support upstream

When I started working on this I checked upstream and couldn't find any prior support. Checking again now it does appear that (at-least partial) support for this may have been added just 4 days ago. So it might be more appropriate to move onto the upstream implementation, though I don't know what extra work will be required there with regards to Ibex's implementation specifics, nor how much work it will be to pull in recent upstream changes.

you cannot neither add OpenTitan references to sifive_plic.c, as it is a generic file, nor OpenTitan specific behavior there.

I guess in that case the alternative to make this fix is to have all PLIC signals go through the PLIC_EXT first, which then propagates them. But this would require duplicating a lot of the PLIC functionality outside of the PLIC, to the point that we would be essentially implementing a new PLIC anyhow. I see that we have already added e.g. tracing to the SiFive PLIC. Importantly, this is a case where OpenTitan's functionality seems to differ from the generic RISC-V PLIC. I'm not sure what would be best to do here.

I'm not sure to understand the requirement for reverting to qemu_irq for alerts ... Can you elaborate a bit more on this requirement? Thanks.

When I was looking at the rv_core_ibex_nmi_irq_test, that test has multiple iterations in which it triggers an NMI by using the ALERT_TEST register of the pwrmgr to force a Fatal Fault alert. It repeatedly forces the alert, without resetting the state of the ALERT_TEST register in-between. In our current RTL, this is fine: each write to the alert_test register transmits a new alert to the alert handler, triggering the NMI. In our current QEMU model, the first transmission sends a 1, but subsequent writes still just set the IRQ to 1, meaning that subsequent alerts are not transmitted (as IbexIRQ is effectively edge-triggered).

This is why in this case I judged it appropriate for these specific alert signals to the alert handler to change them to use qemu_irq, as the additional IbexIRQ optimisation does not make sense here. An alternative might be setting the alert signal back to 0 after setting it to 1, i.e. qemu_set_irq(s->alert, (int)(bool)val32); qemu_set_irq(s->alert, 0u);, but that seems less elegant and more like a workaround compared to the qemu_irq approach. I see what you mean about how further changes to other alert cases might be needed on specific IP blocks, however.

we need to check how this impact other features such as the dret instruction

This is definitely worth considering; I've only considered mret and am not sure if this would have any impact on other instructions that I haven't thought about.

@rivos-eblot
Copy link

When I started working on this I checked upstream and couldn't find any prior support. Checking again now it does appear that (at-least partial) support for this may have been added just 4 [days ago]
(qemu@c1149f6).

:-)

I guess in that case the alternative to make this fix is to have all PLIC signals go through the PLIC_EXT first,

Do you mean ot_plic_ext.c? This was indeed a way to add OT-specific PLIC feature without altering the generic implementation. It only works as it uses another memory range, it does not alter the behavior of the PLIC.

I'm not sure what would be best to do here.

The sifive_plic.c supports an edge-triggered implementation, which can be selected through a property, but I'm not sure this is what we are looking for. Is the OT PLIC an edge-triggered PLIC or a level-triggered one?

I have not read about the NMI support in OT. As RISC-V NMI are implementation-specific, I think we'll have a hard time to fit it into QEMU without breaking the other machines.

It repeatedly forces the alert, without resetting the state of the ALERT_TEST register in-between. In our current RTL, this is fine: each write to the alert_test register transmits a new alert to the alert handler, triggering the NMI.

How does that work at RTL level? It would nice to have a look at some waves. Maybe writing to ALERT_TEST creates an edge (i.e. the next clock cycle reset the level of the alert line?). I've been asking myself this question for about 2 years and never took time to have a look at it. Rather, I tried to have a look at it, but gave up to find how to produce waves and observe them with gtkwave, then moved to another topic 😞)

@AlexJones0
Copy link
Author

AlexJones0 commented Jan 22, 2025

Do you mean ot_plic_ext.c? ... It only works as it uses another memory range, it does not alter the behavior of the PLIC.

Yes, but to make this work we would have to replicate a lot of the existing PLIC functionality in the plic_ext to have the needed values be accessible, essentially composing the PLIC and making the plic_ext wrap it to add OT-specific functionality.

The sifive_plic.c supports an edge-triggered implementation, which can be selected through a property, but I'm not sure this is what we are looking for. Is the OT PLIC an edge-triggered PLIC or a level-triggered one?

As far as I can tell, the OT PLIC is both edge-triggered and level-triggered, dependent upon the hardware interrupt source: "It receives interrupt events as either edge or level of the incoming interrupt signals". That is a separate issue though, I think - the primary concern here is that interrupts can be acknowledged at the peripheral, but still be pending and awaiting claim/completion in the PLIC. Based on my understanding, current behaviour means that for example:

  1. Watchdog NMI is enabled.
  2. Watchdog timer barks.
  3. An NMI and IRQ should be raised.
  4. The NMI is handled first, calling into the NMI handler.
  5. The NMI handler acknowledges the watchdog bark at the peripheral, to clear the NMI.
  6. The NMI handler returns. The bark IRQ was cleared at the peripheral, but is still pending at the PLIC, so now the IRQ is handled, calling into the IRQ handler.
  7. The IRQ handler claims the IRQ at the PLIC, clearing it there as well.
  8. The IRQ handler returns and regular execution continues.

In the current PLIC implementation, part 6 is not correct to the OT implementation - the NMI handler acknowledges the IRQ at the peripheral, which sends a signal to the PLIC, causing it to also clear the IRQ, meaning that steps 6-7 do not properly occur. It appears that OpenTitan has custom behaviour where an IRQ that is enabled, and pending but not claimed cannot be cleared by a change in the incoming IRQ signal from the peripheral.

To me the correct solution is implementing an OpenTitan PLIC that extends this generic RISC-V sifive_plic. In this case maybe the more appropriate way to do that is to implement a callback/hook to the sifive_plic, that would allow only OT to implement this added check, whereas all other machines can continue on as normal?

How does that work at RTL level? It would nice to have a look at some waves. ... but gave up to find how to produce waves and observe them with gtkwave ...

I'm not sure either; I've been likewise unsuccessful in my attempts at getting waves with OT.

However, digging a bit into the RTL, I think I've found something elucidating. If I'm interpreting this comment correctly, it seems that regular alerts are latched in a local register until they are reset. On the contrary, it seems like test alerts specifically are not latched until reset, which explains the behaviour I've been seeing and am trying to implement.

Based on this comment, perhaps what I had deemed to be the 'less appropriate' solution is in fact more appropriate here: specifically for writes to alert_test, we just immediately clear the signal (with qemu_set_irq(s->alert, (int)(bool)val32); qemu_set_irq(s->alert, 0u);), such that the alert signal is seen but can be sent multiple times via ALERT_TEST, or followed by a subsequent real alert which must be reset. Does this sound sensible to you?

@rivos-eblot
Copy link

Yes, but to make this work we would have to replicate a lot of the existing PLIC functionality in the plic_ext to have the needed values be accessible, essentially composing the PLIC and making the plic_ext wrap it to add OT-specific functionality.

Maybe we could implement what the RISC-V specification documents as the "interrupt gateway", i.e. something between the device that generates the IRQ and the PLIC controller itself. In any case, we cannot change the behavior of sifive_plic.c. We just need to find the best way to add the missing features with the less hassle :)

As far as I can tell, the OT PLIC is both edge-triggered and level-triggered, dependent upon the hardware interrupt source: "It receives interrupt events as either edge or level of the incoming interrupt signals". That is a separate issue though, I think.

If the interrupt was handled as edge-triggered, the PLIC would still see the IRQ has raised even once the peripheral had lowered its signalling line, would it not?

BTW, if I understand it right, the NMI is not a real RISC-V NMI - which are not resumable for RISC-V -, it is a more a custom implementation of an RNMI (resumable NMI), isn't it? So there are IRQs, ~RNMI and Alerts. Wow...

  1. The NMI handler returns. The bark IRQ was cleared at the peripheral, but is still pending at the PLIC, so now the IRQ is handled, calling into the IRQ handler.

How does that work at RTL level? It would nice to have a look at some waves. ... but gave up to find how to produce waves and observe them with gtkwave ...

I'm not sure either; I've been likewise unsuccessful in my attempts at getting waves with OT.

I think we need to ask for help on this one, so that QEMU sticks as much as possible to the actual behavior.

However, digging a bit into the RTL, I think I've found something elucidating. If I'm interpreting this comment correctly, it seems that regular alerts are latched in a local register until they are reset. On the contrary, it seems like test alerts specifically are not latched until reset, which explains the behaviour I've been seeing and am trying to implement.

Yes, that would make sense.

Based on this comment, perhaps what I had deemed to be the 'less appropriate' solution is in fact more appropriate here: specifically for writes to alert_test, we just immediately clear the signal (with qemu_set_irq(s->alert, (int)(bool)val32); qemu_set_irq(s->alert, 0u);), such that the alert signal is seen but can be sent multiple times via ALERT_TEST, or followed by a subsequent real alert which must be reset. Does this sound sensible to you?

Yes, I think this should be ok. I'm wondering if I have not already implemented this somewhere in Darjeeling BTW (I tend to forget most of the work I do....). Nit pick, not an unsigned 0, as IRQs are signed integers for some reason in QEMU.

@AlexJones0
Copy link
Author

AlexJones0 commented Jan 22, 2025

Maybe we could implement what the RISC-V specification documents as the "interrupt gateway", i.e. something between the device that generates the IRQ and the PLIC controller itself. In any case, we cannot change the behavior of sifive_plic.c.

This sounds like what I was suggesting we make the plic_ext into above. It seems reasonable given that we do not want to modify sifive_plic.c, but the current behavioural changes require access to the internals of the sifive_plic (specifically, plic->claimed and plic->pending).

I wonder if there's an easy way we could either (a) make a separate interrupt gateway device that has access to these internals, or (b) make a PLIC "wrapper" that composes the sifive_plic and can access its internals, but essentially just acts as you describe here.

The main issue I'm trying to avoid here is replicating the sifive_plic internals inside this interrupt gateway to allow access to these claimed and pending values, which seems less than ideal.

If the interrupt was handled as edge-triggered, the PLIC would still see the IRQ has raised even once the peripheral had lowered its signalling line, would it not?

Yes, I think that you are correct. Making the PLIC edge-triggered would seem like the most satisfactory solution then, but I'm not sure this is 100% correct, as in OpenTitan we have both edge-triggered ("event type") and level-triggered ("status" type) interrupts. From my understanding, status type interrupts like for example the UART tx_watermark/rx_watermark are level-triggered. Does the sifive_plic provide the ability for us to designate edge/level triggered per interrupt, or just globally, for the entire PLIC?

BTW, if I understand it right, the NMI is not a real RISC-V NMI - which are not resumable for RISC-V -, it is a more a custom implementation of an RNMI (resumable NMI), isn't it? So there are IRQs, ~RNMI and Alerts. Wow...

Yes, as far as I can tell these are not UNMI as defined in the RISC-V spec (though they are close to the implementation), nor are these RNMI from the Smrnmi extension (we don't have those registers in Ibex, and as far as I can tell, that is what is being used in the upstream support) - likely because that was only ratified in June 2024. So yes, I think Ibex does indeed have a custom implementation of RNMI. Specifically, based on the Ibex docs, we actually base our CSRs on a (quite old) proposal: riscv/riscv-isa-manual#261.

So reading that I am still not sure my implementation is 100% correct: I think I need to introduce additional Ibex-specific CSRs for backing up the MPP/MIE/MCAUSE etc. values, to account for the case where an NMI happens while handling another interrupt?

@rivos-eblot
Copy link

I wonder if there's an easy way we could either (a) make a separate interrupt gateway device that has access to these internals, or (b) make a PLIC "wrapper" that composes the sifive_plic and can access its internals, but essentially just acts as you describe here.

I would rather see something in front of the PLIC, but I've really overlooked this issue so I might be 100% wrong :)

Does the sifive_plic provide the ability for us to designate edge/level triggered per interrupt, or just globally, for the entire PLIC?

Unfortunately this is a global setting.
However I really think we should have a look at the waves first (or someone from the HW team that can describe how the RTL implements the IRQ management), before going into a dev phase.

Between DJ and EG, IRQs of some devices (namely the SPI device) have moved from edge-triggered to level-triggered IRQ, but these changes where only about how the device manages its IRQ signals, not how they are handled by the PLIC. We need to better understand this topic before adding/changing more code I believe.

Yes, as far as I can tell these are not UNMI as defined in the RISC-V spec (though they are close to the implementation), nor are these RNMI from the Smrnmi extension

You are right. This means it's gonna be even more difficult to integrate into QEMU, as we do not want to break the other RISC-V implementation nor add vendor-specific code to generic files...

@AlexJones0
Copy link
Author

You are right. This means it's gonna be even more difficult to integrate into QEMU, as we do not want to break the other RISC-V implementation nor add vendor-specific code to generic files...

It unfortunately does not appear to be compatible with Smrnmi either: that extension creates additional CSRs for handling NMIs, whilst not changing the regular CSRs, whereas Ibex operates by backing up the regular CSRs into additional CSRs, and then changing the regular CSRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants