-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't save session until after message fully handled? #18
Comments
This decrypt variant should fit your needs: It calls the callback first, and only after the callback has completed the updated session is saved. If the process is killed or power goes down the message should remain decryptable.
|
That seems perfect. It does not seem to offer that option currently: I can add a PR that adds an optional callback parameter to that function however unless I am missing it elsewhere. Would need a different callback signature and to perform the same actions with stripping. This all is happening inside of a lock which isn't the best but I think for apps that want that sort of reliability it would be a benefit. |
You are right, libsignal-service-java and dotnet don't use it. Since this should affect Signal-Android, maybe we should open an issue there and observe how OWS will fix it? This should™ be reliably reproducible by placing a breakpoint in Signal-Android's PushDecryptJob's handleMessage and killing the process. |
Sure, I don't have an android device to test on but could open the bug as I don't see how a crash right after that wouldn't cause an issue. In the mean time I have implemented this callback support. By default no change, however if you pass a callback function it goes that route instead. I could make the callback decrypt have a different name if desired. I do not still return the valid data if you call the callback to avoid double processing. Here is a comparison. It has whitespace and syntax issues for this repo so if you want the PR I will clean it up. I have tested and it nicely handles a crash during processing with just re-processing on next call: |
So this is odd. While my code above was handling the crash fine normally when you restarted the client you would get the message replayed to you by the server and everything would increment. I am not sure if this is due to the protocol upgrade but is an interesting change. I am not quite sure how to recover the lost message efficiently (it at least doesn't break the ratchet protocol). |
Confirming the issue further. If a message comes in from a new contact (isPreKeySignalMessage: True isSignalMessage: False) and there is a crash during SaveIdentity. The message is dropped (and receive pipe crashes due to exceptional intentionally). App is restarted The next message from the user will come in the same way. A crash again at the same point will just drop the message. It is odd as the server now only pushes the message once, however the client doesn't end up in a key state problem even though its lost the first several messages from the contact. With the old protocol it was very failure resilient using the callback no corruption (or invalid states) and messages that were being handled during the crash were replayed on reconnect. Now while it is still not going to get into an invalid state the messages can be dropped which certainly seems like a bad bug. |
Alright I have found a problem with callback usage. https://github.com/signal-csharp/libsignal-protocol-dotnet/blob/3064f9f2432c4e5a6e57f7f8e6311ad85947c215/libsignal-protocol-dotnet/SessionCipher.cs#L180 it doesn't go to store the session until after the callback (point of the callback). This works well most of the time, except if it is a prekeymessage. Then the session data for later function calls (ie: sessionCipher.getSessionVersion) is not yet added to the session and the application will throw an exception with message "No session" when the callback goes to make any calls that would use it. I am not sure how to proceed: |
Is that actually happening? If so, where are those calls? |
Sure. So right now libsignalservice does not expose a callback option. I added one here: https://github.com/signal-csharp/libsignal-service-dotnet/compare/master...mitchcapper:callback_decode_support?expand=1 for prototyping. All is well except for we have to call GetStrippedMessage which is really the end of the original function but it calls: Now we could easily work around this one limitation either by having the protocol library decrypt message pass us the new sessionRecord or figure out the version another way. Personally I like the idea of passing the new session record if we need it elsewhere but I am not sure if that is the optimal solution or there is a better way to do all of this. |
I added a handler to the protocol library to pass the new session record. If this sounds like a good solution I can put together a PR implementing all these features. Granted with the protocol changes the lack of replay means messages still could be silently dropped but I don't know how to fix that. |
What exactly do you mean with protocol changes and lack of replay? I just started up S-W, breaked in OnMessage (before the call to .Decrypt), killed the app, restarted it and the message arrived again just fine. Has the server's "repush" become unreliable? |
Ha!! I finally figured it out. I think it was a race condition before of how far it got before the app terminated, and explains why break and terminate wouldn't show the same issue. libsignal-service-dotnet/libsignal-service-dotnet/SignalServiceMessagePipe.cs Lines 69 to 76 in cdffc47
is the problem. The confirmation is in a finally block. If handling a message throws an exception it kicks off the ack to the server, depending on how fast shutdown is would determine if it made it or not. PR #33 fixes this. |
I am afraid that is working as intended - consumers get one chance to handle the message. If the OnMessage callback finishes with anything except a TaskCancelledException, we confirm the message since the frontend's handler has terminated. libsignal-service-java is doing the same. If the process is being killed, the finally block should naturally not be executed and the message will be replayed the next time you connect. But this is not related to the session state problems - if you kill the process after the decrypt call, you will be unable to decrypt the message even if it is replayed by the server. |
I am not sure why if the process terminates after the decrypt call you would be unable to decrypt the replayed message? The only reason you can't decrypt is if the local counter has already been updated (then you would get a duplicate old message). The problem here is this is a race condition. It is a race condition that can at least be worked around (for example if client has an exception it can manually set the cancel token to make the library truly abort out). If an exception occurs during message handling my feeling is it should be treated as transnational, everything is rolled back and processing can try again (that way if it is a intermittent issue the message is not lost). Also, I think even if OnMessage through TaskCancelled it would still send, the OnMessage needs to specifically cancel that token out. There are work arounds either way, but from a reliability point currently I would say it is murky as to the expected behavior (and what would actually happen). |
Right now, the counter is updated during the decrypt call, so if the process crashes after decrypt and before confirmation you will receive a replay which you can't decrypt.
I am not certain that I understand. The client's callback is called, and unless execution returns from the callback it is not confirmed. The client can chose to
If it does return, the client has had plenty of time to handle the message and we confirm it. If the confirmation doesn't make it to the server it does not matter - Decrypt will tell you that it is a duplicate message.
Oh you are absolutely correct, you'd have to cancel the pipe's token. |
Ah! Soooo the reason this is (potentially) important is that if an exception is thrown during the decrypt Callback (so before the the counter is updated) this could ack the message on the server despite the fact the client could try again. Here is a sample flow that would be bad: A user can work around this by try/catching in the OnMessage handler if an exception happens then manually setting the cancel token and then throwing it. |
If your storage is unavailable for a moment, you should either wait until it comes back (so the OnMessage callback won't finish and libsignal won't ack the message until OnMessage does finish) or cancel the token and throw anything, which will close the pipe and not ack the message. But why does an exception in Decrypt make your process exit? It really should not, every message should be acked unless you cancel the token or kill the environment. Are we talking about normal managed exceptions or a call to Environment.Exit? |
Sorry we never finished this discussion. Normal managed exceptions. The problem is the potential expectation that throwing an exception would terminate right away and further action would not occur, instead the try/catch with ack causes something unexpected. There are work around (setting the cancel token, or hard killing the process). I think avoiding the exception handling might be a good strategy but also happy to finish this callback method support (as it is in the native code anyway) and remove the async changes. |
I am not sure if this should go here or in libsignal-protocol-dotnet. I am not sure if this is even something that is desired. Right now https://github.com/signal-csharp/libsignal-protocol-dotnet/blob/master/libsignal-protocol-dotnet/SessionCipher.cs decrypt (one of the many) saves the session with the new counter during the decrypt stage.
If the client crashes while handling the message (say somewhere in IMessagePipeCallback::OnMessage) then the counter is wrong and on next restart it tries to re-handle the message and throws DuplicateMessageException.
It might be nice for us to not update the counter until the message is fully handled. While it is easy enough to ignore DuplicateMessageException it would be better if we didn't increment the counter until after the message is fully handled to allow recovery of crashed messages.
Interested in thoughts. May cause some async issues, but may make clients able to recover from failures more easily as a message has to be successfully handled first.
The text was updated successfully, but these errors were encountered: