Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't save session until after message fully handled? #18

Open
mitchcapper opened this issue Apr 23, 2018 · 18 comments · May be fixed by #34
Open

Don't save session until after message fully handled? #18

mitchcapper opened this issue Apr 23, 2018 · 18 comments · May be fixed by #34

Comments

@mitchcapper
Copy link
Contributor

I am not sure if this should go here or in libsignal-protocol-dotnet. I am not sure if this is even something that is desired. Right now https://github.com/signal-csharp/libsignal-protocol-dotnet/blob/master/libsignal-protocol-dotnet/SessionCipher.cs decrypt (one of the many) saves the session with the new counter during the decrypt stage.

If the client crashes while handling the message (say somewhere in IMessagePipeCallback::OnMessage) then the counter is wrong and on next restart it tries to re-handle the message and throws DuplicateMessageException.

It might be nice for us to not update the counter until the message is fully handled. While it is easy enough to ignore DuplicateMessageException it would be better if we didn't increment the counter until after the message is fully handled to allow recovery of crashed messages.

Interested in thoughts. May cause some async issues, but may make clients able to recover from failures more easily as a message has to be successfully handled first.

@Trolldemorted
Copy link
Contributor

Trolldemorted commented Apr 23, 2018

This decrypt variant should fit your needs: It calls the callback first, and only after the callback has completed the updated session is saved. If the process is killed or power goes down the message should remain decryptable.

I'm rather certain that libsignal-service-dotnet uses it, are you sure the DuplicateMessageExceptions are caused by decrypt? Ciphertext duplications can happen on the wire, Signal-Android sometimes sends the ciphertext several times if it does not receive a confirmation from the server.

@mitchcapper
Copy link
Contributor Author

mitchcapper commented Apr 24, 2018

That seems perfect. It does not seem to offer that option currently:
https://github.com/signal-csharp/libsignal-service-dotnet/blob/master/libsignal-service-dotnet/crypto/SignalServiceCipher.cs#L133-L137

I can add a PR that adds an optional callback parameter to that function however unless I am missing it elsewhere. Would need a different callback signature and to perform the same actions with stripping. This all is happening inside of a lock which isn't the best but I think for apps that want that sort of reliability it would be a benefit.

@Trolldemorted
Copy link
Contributor

You are right, libsignal-service-java and dotnet don't use it.

Since this should affect Signal-Android, maybe we should open an issue there and observe how OWS will fix it? This should™ be reliably reproducible by placing a breakpoint in Signal-Android's PushDecryptJob's handleMessage and killing the process.

@mitchcapper
Copy link
Contributor Author

Sure, I don't have an android device to test on but could open the bug as I don't see how a crash right after that wouldn't cause an issue. In the mean time I have implemented this callback support. By default no change, however if you pass a callback function it goes that route instead. I could make the callback decrypt have a different name if desired. I do not still return the valid data if you call the callback to avoid double processing.

Here is a comparison. It has whitespace and syntax issues for this repo so if you want the PR I will clean it up. I have tested and it nicely handles a crash during processing with just re-processing on next call:
https://github.com/signal-csharp/libsignal-service-dotnet/compare/master...mitchcapper:callback_decode_support?expand=1

@mitchcapper
Copy link
Contributor Author

So this is odd. While my code above was handling the crash fine normally when you restarted the client you would get the message replayed to you by the server and everything would increment.
Now I can confirm settings are not being saved out before finish (so callback working) however the message shows as delivered on the sending client but no replay will occur. The next message goes off without issue as well (before the store request for updating the profile not happening would cause issues).

I am not sure if this is due to the protocol upgrade but is an interesting change. I am not quite sure how to recover the lost message efficiently (it at least doesn't break the ratchet protocol).

@mitchcapper
Copy link
Contributor Author

Confirming the issue further. If a message comes in from a new contact (isPreKeySignalMessage: True isSignalMessage: False) and there is a crash during SaveIdentity.

The message is dropped (and receive pipe crashes due to exceptional intentionally). App is restarted The next message from the user will come in the same way. A crash again at the same point will just drop the message.

It is odd as the server now only pushes the message once, however the client doesn't end up in a key state problem even though its lost the first several messages from the contact.

With the old protocol it was very failure resilient using the callback no corruption (or invalid states) and messages that were being handled during the crash were replayed on reconnect. Now while it is still not going to get into an invalid state the messages can be dropped which certainly seems like a bad bug.

@mitchcapper
Copy link
Contributor Author

Alright I have found a problem with callback usage. https://github.com/signal-csharp/libsignal-protocol-dotnet/blob/3064f9f2432c4e5a6e57f7f8e6311ad85947c215/libsignal-protocol-dotnet/SessionCipher.cs#L180 it doesn't go to store the session until after the callback (point of the callback).

This works well most of the time, except if it is a prekeymessage. Then the session data for later function calls (ie: sessionCipher.getSessionVersion) is not yet added to the session and the application will throw an exception with message "No session" when the callback goes to make any calls that would use it.

I am not sure how to proceed:
1) Give up on proper callback support. This at a might lead to corruption of state (although later versions of signal protocol don't seem to actually replay not acked messages) certainly leads to lost messages.
2) The only way I see for this to work is probably to change libsignal protocol at a minimum passing the new session to the callback and potentially just adding it before the receipt is sent or try to rework the functions we need to use to fully decode the message during callback to not cause an issue with the session not added to the local cipher yet.
3) For pre-key messages don't allow callback usage force the old route. It cuts down on the vulnerable service but isn't perfect.

@Trolldemorted
Copy link
Contributor

Then the session data for later function calls (ie: sessionCipher.getSessionVersion) is not yet added to the session and the application will throw an exception with message "No session" when the callback goes to make any calls that would use it.

Is that actually happening? If so, where are those calls?

@mitchcapper
Copy link
Contributor Author

Sure. So right now libsignalservice does not expose a callback option. I added one here: https://github.com/signal-csharp/libsignal-service-dotnet/compare/master...mitchcapper:callback_decode_support?expand=1 for prototyping.

All is well except for we have to call GetStrippedMessage which is really the end of the original function but it calls:
new PushTransportDetails(sessionCipher.getSessionVersion());

Now we could easily work around this one limitation either by having the protocol library decrypt message pass us the new sessionRecord or figure out the version another way. Personally I like the idea of passing the new session record if we need it elsewhere but I am not sure if that is the optimal solution or there is a better way to do all of this.

@mitchcapper
Copy link
Contributor Author

I added a handler to the protocol library to pass the new session record. If this sounds like a good solution I can put together a PR implementing all these features. Granted with the protocol changes the lack of replay means messages still could be silently dropped but I don't know how to fix that.

@Trolldemorted
Copy link
Contributor

What exactly do you mean with protocol changes and lack of replay?

I just started up S-W, breaked in OnMessage (before the call to .Decrypt), killed the app, restarted it and the message arrived again just fine.

Has the server's "repush" become unreliable?

@mitchcapper
Copy link
Contributor Author

mitchcapper commented Aug 17, 2018

Ha!! I finally figured it out. I think it was a race condition before of how far it got before the app terminated, and explains why break and terminate wouldn't show the same issue.

finally
{
if (!Token.IsCancellationRequested)
{
Logger.LogDebug("Confirming message {0}", request.Id);
Websocket.SendResponse(response);
}
}

is the problem. The confirmation is in a finally block. If handling a message throws an exception it kicks off the ack to the server, depending on how fast shutdown is would determine if it made it or not.
PR #33 fixes this.

@Trolldemorted
Copy link
Contributor

I am afraid that is working as intended - consumers get one chance to handle the message. If the OnMessage callback finishes with anything except a TaskCancelledException, we confirm the message since the frontend's handler has terminated. libsignal-service-java is doing the same.

If the process is being killed, the finally block should naturally not be executed and the message will be replayed the next time you connect.

But this is not related to the session state problems - if you kill the process after the decrypt call, you will be unable to decrypt the message even if it is replayed by the server.

@mitchcapper
Copy link
Contributor Author

I am not sure why if the process terminates after the decrypt call you would be unable to decrypt the replayed message? The only reason you can't decrypt is if the local counter has already been updated (then you would get a duplicate old message).

The problem here is this is a race condition. It is a race condition that can at least be worked around (for example if client has an exception it can manually set the cancel token to make the library truly abort out).
It is a race condition as our code is not synchronous the send message call is async. This means that if the aborts out fast enough (due to the exception) the server doesn't get the confirmation, but if there was another try catch at play (that say logs the exception to a remote network log server before aborting) it could.

If an exception occurs during message handling my feeling is it should be treated as transnational, everything is rolled back and processing can try again (that way if it is a intermittent issue the message is not lost).

Also, I think even if OnMessage through TaskCancelled it would still send, the OnMessage needs to specifically cancel that token out.

There are work arounds either way, but from a reliability point currently I would say it is murky as to the expected behavior (and what would actually happen).

@Trolldemorted
Copy link
Contributor

I am not sure why if the process terminates after the decrypt call you would be unable to decrypt the replayed message? The only reason you can't decrypt is if the local counter has already been updated (then you would get a duplicate old message).

Right now, the counter is updated during the decrypt call, so if the process crashes after decrypt and before confirmation you will receive a replay which you can't decrypt.

The problem here is this is a race condition. It is a race condition that can at least be worked around (for example if client has an exception it can manually set the cancel token to make the library truly abort out).
It is a race condition as our code is not synchronous the send message call is async. This means that if the aborts out fast enough (due to the exception) the server doesn't get the confirmation, but if there was another try catch at play (that say logs the exception to a remote network log server before aborting) it could.

I am not certain that I understand. The client's callback is called, and unless execution returns from the callback it is not confirmed. The client can chose to

  • call both async and sync functions
  • save the data and process it later (so you won't need a replay)
  • terminate the process (if your storage or memory is running full, and want a replay)
  • not return until the issues are solved (while (!TrySave()) { }-style)

If it does return, the client has had plenty of time to handle the message and we confirm it. If the confirmation doesn't make it to the server it does not matter - Decrypt will tell you that it is a duplicate message.

Also, I think even if OnMessage through TaskCancelled it would still send, the OnMessage needs to specifically cancel that token out.

Oh you are absolutely correct, you'd have to cancel the pipe's token.

@mitchcapper
Copy link
Contributor Author

Ah! Soooo the reason this is (potentially) important is that if an exception is thrown during the decrypt Callback (so before the the counter is updated) this could ack the message on the server despite the fact the client could try again.

Here is a sample flow that would be bad:
Client calls Decrypt with a callback.
Decrypt starts process, calls callback with data (before counter is incremented)
Client cannot save off the data due to an error (say cannot reach database) and throws an exception
The finally handler kicks off trying to ack the message to the server.
Now:
If the process exits fast enough (from the exception) the server ack will not go through.
If the process does not exit fast enough (for example if an upper level exception handle logs the exception before exiting) the message is acked and lost forever.

A user can work around this by try/catching in the OnMessage handler if an exception happens then manually setting the cancel token and then throwing it.

@Trolldemorted
Copy link
Contributor

Here is a sample flow that would be bad:
Client calls Decrypt with a callback.
Decrypt starts process, calls callback with data (before counter is incremented)
Client cannot save off the data due to an error (say cannot reach database) and throws an exception
The finally handler kicks off trying to ack the message to the server.
Now:
If the process exits fast enough (from the exception) the server ack will not go through.
If the process does not exit fast enough (for example if an upper level exception handle logs the exception before exiting) the message is acked and lost forever.

If your storage is unavailable for a moment, you should either wait until it comes back (so the OnMessage callback won't finish and libsignal won't ack the message until OnMessage does finish) or cancel the token and throw anything, which will close the pipe and not ack the message.

But why does an exception in Decrypt make your process exit? It really should not, every message should be acked unless you cancel the token or kill the environment. Are we talking about normal managed exceptions or a call to Environment.Exit?

@mitchcapper
Copy link
Contributor Author

Sorry we never finished this discussion. Normal managed exceptions. The problem is the potential expectation that throwing an exception would terminate right away and further action would not occur, instead the try/catch with ack causes something unexpected. There are work around (setting the cancel token, or hard killing the process). I think avoiding the exception handling might be a good strategy but also happy to finish this callback method support (as it is in the native code anyway) and remove the async changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants