Skip to content

RX window timing depends on LoRaMacProcess call timing #1349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matthijskooijman opened this issue Aug 31, 2022 · 11 comments
Closed

RX window timing depends on LoRaMacProcess call timing #1349

matthijskooijman opened this issue Aug 31, 2022 · 11 comments

Comments

@matthijskooijman
Copy link

Since commit 60fb1b3 the RX window timers are no longer started in the TxDone interrupt handler, but deferred to ProcessRadioTxDone() (called by LoRaMacProcess() from the main loop). This means there is an arbitrary delay (that depends on how often the application calls LoRaMacProcess(), but is necessarily non-zero) between the TxDone interrupt and the timers being started by ProcessRadioTxDone(). However, the timers are started as follows:

TimerSetValue( &MacCtx.RxWindowTimer1, MacCtx.RxWindow1Delay );
TimerStart( &MacCtx.RxWindowTimer1 );
TimerSetValue( &MacCtx.RxWindowTimer2, MacCtx.RxWindow2Delay );
TimerStart( &MacCtx.RxWindowTimer2 );

So relative to the moment of starting the timer, not relative to the TxDone timing, so the window is always more or less later than intended.

I have seen this happen in practice with the STM32CubeWL stack, which is a port of LoRaMac-node and also contains this same code. Debug output shows that LoRaMacProcess() is called about 20ms after TxDone, and the RX window is late by the same amount of time. I have not verified exact timing with a scope (nor tested on this upstream code - I'm not sure if I have any hardware available for that), but the problem seems clear enough from looking at the code to report it already now.

This is probably easy to fix: Just set the timer relative to TxDoneParams.CurTime (which is set by the TxDone ISR already) instead of relative to now. Since the timer API only seems to support setting timers relative to now (there is no way to pass an absolute time), I guess this means something like:

 uint32_t offset = TimerGetCurrentTime() - TxDoneParams.CurTime;
 TimerSetValue( &MacCtx.RxWindowTimer1, MacCtx.RxWindow1Delay - offset ); 
 TimerStart( &MacCtx.RxWindowTimer1 ); 
 TimerSetValue( &MacCtx.RxWindowTimer2, MacCtx.RxWindow2Delay - offset); 
 TimerStart( &MacCtx.RxWindowTimer2 ); 

I think the above code should also be ran atomically (All ISRs disabled) to ensure no significant time passes between the above call to TimerGetCurrentTime() and the moment the timer code makes the timestamp absolute:

elapsedTime = RtcGetTimerElapsedTime( );
obj->Timestamp += elapsedTime;

This atomicity could be prevented if the timer code gained an interface to start a timer for an absolute instead of relative time.

matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 13, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 14, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 14, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 14, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 14, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
@descartes
Copy link

Would you like me to send you something based on STM32WL?

@matthijskooijman
Copy link
Author

@descartes, I'm not sure I follow your offer? I already have code from the STM32CubeWL package working and applied the fix suggested above, which indeed seems to work to fix the problem (I haven't solved atomicity though). I've also confirmed the problem actually exists, by inserting a 100ms delay before each call to LoRaMacProcess()/

@descartes
Copy link

I was referring to any hardware you need - boards with STM32WL yes, new oscilloscope not so much.

@matthijskooijman
Copy link
Author

Ah, I have 2 Nucleo STM32WL55JC boards, so I'm good :-)

What would be helpful, I think, is for someone to reproduce this issue (and try the fix) with the original LoRaMac-node code (instead of the STM32CubeWL code which is modified by ST). I'm lacking hardware, a test setup and time for that...

@mluis1
Copy link
Contributor

mluis1 commented Sep 21, 2022

@matthijskooijman Thanks for the finding and fix proposition.

Following your recommendation I have applied below patch

diff --git a/src/mac/LoRaMac.c b/src/mac/LoRaMac.c
index edb172cc..3a102afe 100644
--- a/src/mac/LoRaMac.c
+++ b/src/mac/LoRaMac.c
@@ -810,11 +810,15 @@ static void ProcessRadioTxDone( void )
     {
         Radio.Sleep( );
     }
+
     // Setup timers
-    TimerSetValue( &MacCtx.RxWindowTimer1, MacCtx.RxWindow1Delay );
+    CRITICAL_SECTION_BEGIN( );
+    uint32_t offset = TimerGetCurrentTime( ) - TxDoneParams.CurTime;
+    TimerSetValue( &MacCtx.RxWindowTimer1, MacCtx.RxWindow1Delay - offset );
     TimerStart( &MacCtx.RxWindowTimer1 );
-    TimerSetValue( &MacCtx.RxWindowTimer2, MacCtx.RxWindow2Delay );
+    TimerSetValue( &MacCtx.RxWindowTimer2, MacCtx.RxWindow2Delay - offset );
     TimerStart( &MacCtx.RxWindowTimer2 );
+    CRITICAL_SECTION_END( );
 
     if( MacCtx.NodeAckRequested == true )
     {

I have run a quick test using a NucleoL476+SX1276MB1LAS platform.
To perform the quick test I have used the LCTT tool (LoRa-Alliance certication tool) and run the test verifying RXTimingSetupReq MAC command.
The test ran successfully. Please refer to rx_timing_setup_req_test_log.zip for the test tool report.

We will push the fix as soon as we have finalized the review process and run further tests.

@mluis1 mluis1 added this to the Release Version 4.7.0 milestone Sep 21, 2022
@matthijskooijman
Copy link
Author

Cool, that matches the fix I applied to the STM32CubeWL code as well (except I omitted the critical section for now).

Did you also confirm the problem by adding some artificial delay? Easiest is probably to just add a delay inside LoRaMacProcess at the start, something like 100ms (must not be too small, since there is already a fixed 20ms margin in the RX timing, and at higher SF's the margin is even more, so I mainly tested at SF7).

About the critical section - I haven't measured, but my intuition is that this produces a fairly long criticial section that would add problematic latency for some kinds of interrupts. I would suggest, as an additional improvement, considering to revise the Timer API adding something like TimerStartAbs() or so where you pass an absolute timestamp (and ignore the value passed to TimerSetValue). If this would be available, using it could remove the need for the critical section completely AFAICS. I'm not sure how the timer util API is maintained and developed, maybe changing that is a more complicated process, but it looks like this would be a change in timer.c/h only, without needing changes to the underlying Rtc*() API.

@mluis1
Copy link
Contributor

mluis1 commented Sep 22, 2022

Please find in following zip the result for some Rx windows tests. A 100 ms delay has been added just before the CRITICAL_SECTION_BEGIN call. The performed tests verify all supported data rates.
rx_window_test_log_with_100_ms_delay.zip

The time spent inside the critical section will depend on the number of timers that have been added to the linked list. At a given time I think that the number of timers in the list will be small. However, this will depend deeply on the application being built.

For the moment it should be quite safe to keep the critical section.

Adding a new API allowing to start a timer with absolute time is indeed a good idea for this kind of operation.
As we are getting closer to execute v4.7.0 release we will postpone such API addition to a future release.
In order to not forget to add this API we will create an issue under our internal Git server as a reminder.

@matthijskooijman
Copy link
Author

For the moment it should be quite safe to keep the critical section.

Adding a new API allowing to start a timer with absolute time is indeed a good idea for this kind of operation.
As we are getting closer to execute v4.7.0 release we will postpone such API addition to a future release.

Agreed, and sounds good. Thanks!

matthijskooijman added a commit to stm32duino/STM32LoRaWAN that referenced this issue Sep 23, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
fpistm pushed a commit to stm32duino/STM32LoRaWAN that referenced this issue Oct 6, 2022
For the RX window timing, a timer is started after TxDone. However,
since this is done in the mainloop rather than the TxDone ISR, the exact
timing of the RX window depends on how soon after the interrupt
LoRaMacProcess() is called (but it will always be at least a little too
late).

The TxDone interrupt already captures a timestamp, so this commit
uses that to correctly position the RX window. There is still a small
race condition where interrupts could still cause a delay, but the
influence of that should be small and rare enough in practice, so this
is left unfixed here (a proper fix would require changes to the timer
subsystem, which are a bit too invasive to make here).

This fixes Lora-net/LoRaMac-node#1349
@slavendam
Copy link

@matthijskooijman
Did you have issues similar to this: stm32duino/STM32LoRaWAN#38
Can it be that some value overflows when IRQ handler/timer is set? Or maybe something connected with RTC timer? I did lot of debuging, but don't have an idea what can be the source of issue...

@matthijskooijman
Copy link
Author

@slavendam Hm, you make a good point. When ProcessRadioTxDone() (through LoRaMacProcess(), or modem.maintain() in the ST32Duino library) is called very late (i.e. >= 1s after the interrupt), offset could end up being more than MacCtx.RxWindow1Delay, resulting in an underflow in this line:

 TimerSetValue( &MacCtx.RxWindowTimer1, MacCtx.RxWindow1Delay - offset );

Which would lead to effectively indefinitely postponing the RX1 handler.

@slavendam Could it be that your sketch is indeed not calling modem.maintain() for a long time?

You could confirm that this is the case, by printing the value of MacCtx.RxWindow1Delay and offset here.

@mluis1 I wonder if it would be worthwhile to catch and handle this case (by cancelling the timers and wrapping up the transaction, I guess? Could be done after the critical section)? Even though it is not really useful to support applications that call LoRaMacProcess() so seldomly, if an application accidentally delays calling it on a rare occasion, it would be good if the stack did not freeze up...

@matthijskooijman
Copy link
Author

From the followup in the other issue, it turns out that the issue @slavendam is seeing is not caused by offset underflowing. I do still think the underflow issue exists, but it is probably obscure enough to just ignore it. I'll leave that here for consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants