Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement backoff for reconnect #112

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
## Next
### Enhancements
- Added `backoff` feature for reconnecting
- `send_frame/2` is no longer processed during `opening`, ie. it is queued until `connected` or failure to connect.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a breaking change, and potentially a large one in my opinion.

I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.

Adding another method or keyword option would be my preference but that means more stuff to maintain. It could be useful though. I remain somewhat on the edge.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this. I understand that it could make the library more convenient, but it could also cause a message to be sent prematurely before some kind of required connection setup was finished.

Can you elaborate on this? Maybe I'm missing something but I'm not sure when a message could be sent prematurely because of this?

Initially I had failing tests since they currently assume that when the process is started, the connection is connected and able to send messages. Before I realized that I could make this change with selective receive so that the send_frame message waits in the inbox until the connection is connected, I added a feature WebSockex.await_status(pid, :connected) so that you could do the following:

{:ok, pid} = WebSockex.start_link(url)
WebSockex.await_status(pid, :connected)
WebSockex.send_frame(pid, message)

However I removed this and instead relied on selective receive because it means less work for the user and one less change in this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_connect is used to queue up initial messages. By casting to itself, a process can queue up messages to send in a specific order after startup.

By handling send_frame after the connection is established, we actually put ourselves at much higher risk for a race condition. A frame could be sent before the initial connection messages were sent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I have 2 thoughts on this.

  1. I'm just exploring how this could work better ignoring breaking changes. Maybe first read my comment on handle_send_result. If instead of send_frame we have handle_send_result and handle_call which would allow a user to implement their own send_frame, and we have init, then the user could create a message queue in their state during init for startup messages and then handle_call could add messages to the queue as well and the queue can be processed in the right order starting from handle_connect. I also think that adding handle_continue support would be useful here. In fact if we just add handle_continue then this can be used from handle_connect to process initial messages before send_frame messages by calling :continue in handle_connect.

  2. Without making breaking changes, we need a way to wait for the connection to be established before calling send_frame. This can either be done by adding await_status as above, or by adding handle_call which has access to the current status. The user can theoretically keep track of the current status with handle_connect and handle_disconnect in their own state, and implement handle_call on top of handle_cast, but it's quite a lot of extra stuff for each user to add which is already a solved problem in a plain GenServer. On the one hand I kinda like adding await_status natively because it requires no extra work from the user, but on the other hand I'm really seeing that everything could be handled much better with more flexibility if WebSockex is on par with GenServer, namely having init, handle_call, handle_continue.

I was trying to solve all the issues I ran into with adding backoff without adding a lot of other features, but it seems that we'd be better served by first adding the other features that we can then more easily build on here.


### Breaking Changes
- Connection is always established asynchronously to process init
- `async` option is removed since it's now redundant
- `handle_initial_conn_failure` option is removed since it's no longer applicable
Comment on lines +7 to +9
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually ruins one of the ways I'm using WebSockex myself. 😢

In my use case, there are a couple of other WebSockex processes that depend on the connection being open before they start then when things disconnect the supervisor uses a :rest_for_one policy to reconnect.

Originally, reconnecting was actually more of a feature request that I challenged myself to do. I rarely used it myself. 😣

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process, and e.g. in a GenServer would normally be done by calling :continue in the init callback. The reason for this is that it can block the entire supervisor tree startup until init returns.
In the case of WebSockex, this can happen if:

  • the server you're connecting to is blocked by a firewall which drops all packets, so it hangs either until tcp timeout or an application timeout (I think WebSockex has either a 5s or 30s timeout)
  • you combine sync connect with infinite reconnect, which means init never returns.

You can also observe the result of this by making a simple WebSockex client with sync connect and infinite reconnect and just doing a WebSockex.start inside an iex session. The entire iex session locks up that you can't even do anything.

I prefer not to give users (of this library) footguns and unexpected behaviour which is why I decided to remove the ability to do this.

I can see that the biggest issue is actually the combination of sync connect and reconnect and that these 2 features are actually at odds of co-existing at the same time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the reason why I removed sync connect is because as far as I know, it's bad practice to handle a potentionally long task in the init stage of a special process

This can definitely be true in the Application supervisor, but sometimes even that is necessary. I don't think you can consider it a bad practice, but it can be annoying to debug. On the other hand, it can also fit really well into the failure as feature mentality.

But there are people who want "footguns" simply because it fits into their application flow easier. That being said, handle_initial_conn_failure is a long and slightly convoluted option that is disabled by default for that reason. It was a requested feature, by multiple people iirc.

But my main argument will be that async unique startup is a feature. I don't expect :gen_tcp to startup then give me an error it can't connect after the incredibly long TCP timeout when I've already done other things with the assumption that I have a valid connection. I wouldn't expect WebSockex to startup without having a WebSocket connection established.

- `handle_disconnect/2` no longer accepts `:reconnect` as a result. This is replaced with `:backoff` with a required timeout.

## 0.4.3
### Enhancements
- Added `ssl_options` documentation
Expand Down
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ defmodule WebSocketExample do
IO.puts "Sending #{type} frame with payload: #{msg}"
{:reply, frame, state}
end

def handle_disconnect(_, state) do
IO.puts "Connection failure. Backing off for 1 second before reconnecting"
{:backoff, 1_000, state}
end
end
```

Expand Down
57 changes: 57 additions & 0 deletions examples/backoff_client.exs
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
defmodule BackoffClient do
use WebSockex
require Logger

def start_link(opts \\ []) do
WebSockex.start_link("wss://echo.websocket.org/?encoding=text", __MODULE__, :fake_state, opts)
end

@spec echo(pid, String.t) :: :ok
def echo(client, message) do
Logger.info("Sending message: #{message}")
WebSockex.send_frame(client, {:text, message})
end

def handle_connect(_conn, state) do
Logger.info("Connected!")
{:ok, state}
end

def handle_frame({:text, "Can you please reply yourself?" = msg}, :fake_state) do
Logger.info("Received Message: #{msg}")
msg = "Sure can!"
ref = make_ref()
Logger.info("Sending message: #{msg} with ref #{inspect(ref)}")
{:reply, {:text, msg}, ref, :fake_state}
end
def handle_frame({:text, "Close the things!" = msg}, :fake_state) do
Logger.info("Received Message: #{msg}")
{:close, :fake_state}
end
def handle_frame({:text, msg}, :fake_state) do
Logger.info("Received Message: #{msg}")
{:ok, :fake_state}
end

def handle_disconnect(%{attempt_number: attempt, reason: reason}, state) do
Logger.warn("Websocket connection failed because: #{inspect(reason)}. Attempt: #{attempt}")
{:backoff, 1_000, state}
end

def handle_send_result(result, frame, send_key, state) do
Logger.debug("Send result for key: #{inspect(send_key)}, frame: #{inspect(frame)}, result: #{inspect(result)}")
{:ok, state}
end
end

{:ok, pid} = BackoffClient.start_link()

BackoffClient.echo(pid, "Yo Homies!")
BackoffClient.echo(pid, "This and That!")
BackoffClient.echo(pid, "Can you please reply yourself?")

Process.sleep 1000

BackoffClient.echo(pid, "Close the things!")

Process.sleep 1500
Loading