Dealing with panics #263

Nemo157 · 2019-05-27T08:22:36Z

Currently Tide doesn't provide any way to deal with panics in handlers, these just unwind up to the executor and (at least in the case of Tokio) kill the executors thread. During the unwinding the TCP connection is dropped and the client gets something like The connection was reset. The application as a whole just keeps running (I'm not sure what happens with other requests currently being handled on the same worker thread).

My first thought on dealing with this is having a middleware that uses catch_unwind to catch the panic and return a minimal 500 to the client. (I have prototyped this and it works very easily).

The other part of this is that when I encountered it I had locked a Mutex stored in the State, this subsequently started returning PoisonError on access so my application was effectively dead. Maybe when a panic is encountered it should be proxied through to the future returned by serve and resumed on the main thread to try and kill the entire application. Alternatively this could be dealt with at the application level using "health checks" with support from a service coordinator (like Kubernetes). Since it depends on usecase it may make sense to provide middleware supporting different options and allow the user to choose.

The text was updated successfully, but these errors were encountered:

yoshuawuyts · 2019-05-27T11:17:37Z

@Nemo157 I like the idea of doing a basic thing where we terminate all connections with a 500. I'm a big fan of the panic = abort model, where if we're panicking it means something has gone terminally wrong, and we need to reboot.

I'm a bit hesitant for any more elaborate schemes though. Especially if there's a risk we might hang while we try and return the 500s (e.g. network timeouts). When a panic occurs we should want to terminate, and let the supervisor restart the application. If panics occur frequently supervisors (e.g. kubernetes) can issue an auto-rollback to give developers time to investigate the failure without it affecting production.

Re: poison errors -- parking-lot provides a way around poison errors specifically, but I'm not sure what to do about {dead,live}locks in general. I think your intuition is right, and we should have a way to provide status checks to any supervisor process so they can judge the status. But this indeed feels like a different topic than "dealing with panics". Perhaps we should open a separate issue on this?

Nemo157 · 2019-05-27T11:43:51Z

I brought up the poison error because that is a knockon effect from panic unwinding. I don't know if there are other commonly used data structures that you might store in State which become unusable during unwinding, but if there are then it seems like a good argument for the "proxy panics through to App::{serve,run}" approach where a panic in any handler kills off the entire application (which can't be dealt with just as a middleware currently). It sounds like you're in favour of this approach rather than the basic "panic fails the current request, but the rest keep running" approach I prototyped.

(Opened #264 for more specifics on {dead,live}locked applications).

prasannavl · 2019-05-27T12:11:04Z

@Nemo157 - Are we talking about panics below the tide layer? (as in hyper, networking layers, etc), or above tide layers? (I think below tide they are fairly self contained, and represent more serious ones which probably should and would crash).

One of the things I've been experimenting with for quite some time is to have middlewares (and basically everything) return Result instead of Response. In this case, the very first middleware is the error handler - where the default handler can be opt out -- and this default handler can simply handle the panics and return 500s. I personally think we should cover all surface area of tide with Result, with the exception of language interop and unsafe code, it pretty much becomes infallible in most general cases. Deadlocked applications will likely timeout, or return 500s, which is probably what we want for most scenarios - not crashing the entire server by surprise edge cases.

With this ideology, someone who wants different behaviour can easily handle panics manually in higher middleware layers, or even simply swap out the default error handler middleware to their own liking.

Short version - I don't think we should be opinionated here at all, other than providing a default behaviour as a part of the middleware stack that's easily replace-able - this is application specific domain, and I don't think tide should interfere.

Nemo157 · 2019-05-27T12:18:23Z

Above tide, in middleware or endpoints. In my case I was part way through implementing a feature so hit an unimplemented!() inside one of my endpoints.

prasannavl · 2019-05-27T12:28:10Z

@Nemo157 - I think a strategy similar to ASP .NET Core, or most javascript frameworks is what I'd proposing here. Have multiple preset middlewares in the beginning - One exception handler (which would become the panic handler doing catch_unwind here as exceptional cases), and one general error handler after that which handles Err part of Result<T, E>.

Also if there's a way to get the panic profile of the compilation during runtime, we can simply use that and preset the handler when the panic mode is set to unwind and leave out the handler to crash the application when it's set to abort - I mean if set to abort, they are likely going to crash anyway, but my point is doing it in layered middlewares seem like the best way to handle it.

Most applications and supervisor (kubernetes, swarm) do health checks anyway, that detect 500s and restart the application. Applications could be running other logic that's not just a pure server, or even proper cleanups to shutdown the application in case of multiple failures. So, I don't think tide should crash the entire application, unless that's the behaviour chosen by the application.

Personally, when I'm on a server loop - I except the server to never crash other than very serious memory allocation, or corruption errors. Here in your example instance, granted State is corrupted - but it's still upto the application to determine how much of the application logic is affected by this. For instance there could be completely irrelevant endpoint that does other things that don't depend on the state, or even one that does - for instance, an endpoint I use to remotely examine the corrupted state. So, I think this always has to be the application's choice - if we let the server crash, this effectively no longer becomes a choice, really.

Nemo157 · 2019-05-27T12:44:58Z

I can open a PR for my middleware that transforms panics into 500s (which can easily change to just return the panic as an error if middleware starts returning results) and take a look at whether it's feasible to implement a middleware that stops the application (I have an idea of how to do it as a side-channel so that it will work with tide::serve without having to change tide::App at all).

prasannavl · 2019-05-27T12:53:51Z

@Nemo157 - do we really need to stop the application? I mean, isn't this logic best left at one place -- the place being the future's runtime? Tokio threadpool for instance has a panic handler, (and the current thread runtime bubbles up if I'm right?) and I suspect runtime will likely need that eventually as well. Won't they do the job of bubbling up and exiting the application, when required so?

Should this decision making be going into tide - unless of course it's purely contained in the middleware layer - in which case any and all things can be explored at will.

Nemo157 · 2019-05-27T13:05:08Z

I want to see exactly what is possible in this space with the current design. Whether there are any issues with proxying a panic back from a middleware into the context in which the tide::App is being run, and whether there is some way to attempt a clean shutdown of the tide::App which allows all outstanding requests to complete (also useful for better SIGINT/SIGTERM handling, after quick testing it appears that ^C currently just closes the socket if the handler is blocked).

My goal is just to make sure that there is some way to implement any sane behaviour that a user might want, not to change what Tide itself does by default.

prasannavl · 2019-05-27T13:12:57Z

Ah, that sounds pretty cool! I'm keen to explore this as well.

Currently I do ugly workaround for clean shutdown with a custom hyper backend using hyper's graceful shutdown channel - by manually connecting them to SIGINT/SIGTERM. But one other alternative I was thinking of experimenting with instead of http-rs/http-service#11 was to do that in one of the first middleware - makes it agnostic of the backend or http-service but will end up with redundant checks on the hotpath.

Nemo157 · 2019-11-11T08:17:31Z

Random thought I just had related to this:

Rather than internally attempting to handle panics and return a 500 Tide (or more likely the default http-service implementation) could transparently implement a multi-process model, the initial process binds the port and spawns an inner process it proxies all requests to, when the inner process dies the outer process closes the port, returns a 500 on all outstanding requests and dies as well. That would allow cleanly responding to application level panics, even with panic = abort, and just leave panics in the http layers used by the outer process unhandled.

prasannavl · 2019-11-11T15:14:11Z

@Nemo157 - Sounds like a good case for an additional model to experiment with, rather than a default replacement for what do when a panic happens in the current process, current thread. If the handling can abstracted to the point where both are provided in the box and either of these can be chosen as easily as a simple middleware add, I think that's a good story for tide.

However focusing in on one may allow additional optimizations like handing over the socket instead of proxy-ing (or SO_REUSEADDR/SO_REUSEPORT can help in OSes where applicable, may be? don't recall the precise behaviour), but I'd still like to have an in-process story.

dzrtc · 2020-01-01T02:11:01Z

Has the tide-panic middleware been lost? It's not in the middleware folder.

rw · 2020-03-12T06:19:03Z

Where is tide-panic? :-)

drewish · 2020-03-16T00:10:22Z

Looks like it got unwound in edadcc3 #336 has some context.

Nemo157 mentioned this issue May 27, 2019

Dealing with {dead,live}locked applications #264

Open

Nemo157 mentioned this issue May 27, 2019

Add middleware for catching panics #265

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dealing with panics #263

Dealing with panics #263

Nemo157 commented May 27, 2019 •

edited

Loading

yoshuawuyts commented May 27, 2019

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 •

edited

Loading

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 •

edited

Loading

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 •

edited

Loading

Nemo157 commented May 27, 2019 •

edited

Loading

prasannavl commented May 27, 2019 •

edited

Loading

Nemo157 commented Nov 11, 2019

prasannavl commented Nov 11, 2019 •

edited

Loading

dzrtc commented Jan 1, 2020

rw commented Mar 12, 2020

drewish commented Mar 16, 2020

Dealing with panics #263

Dealing with panics #263

Comments

Nemo157 commented May 27, 2019 • edited Loading

yoshuawuyts commented May 27, 2019

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 • edited Loading

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 • edited Loading

Nemo157 commented May 27, 2019

prasannavl commented May 27, 2019 • edited Loading

Nemo157 commented May 27, 2019 • edited Loading

prasannavl commented May 27, 2019 • edited Loading

Nemo157 commented Nov 11, 2019

prasannavl commented Nov 11, 2019 • edited Loading

dzrtc commented Jan 1, 2020

rw commented Mar 12, 2020

drewish commented Mar 16, 2020

Nemo157 commented May 27, 2019 •

edited

Loading

prasannavl commented May 27, 2019 •

edited

Loading

prasannavl commented May 27, 2019 •

edited

Loading

prasannavl commented May 27, 2019 •

edited

Loading

Nemo157 commented May 27, 2019 •

edited

Loading

prasannavl commented May 27, 2019 •

edited

Loading

prasannavl commented Nov 11, 2019 •

edited

Loading