Skip to content

Dealing with panics #263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Nemo157 opened this issue May 27, 2019 · 14 comments
Open

Dealing with panics #263

Nemo157 opened this issue May 27, 2019 · 14 comments

Comments

@Nemo157
Copy link
Contributor

Nemo157 commented May 27, 2019

Currently Tide doesn't provide any way to deal with panics in handlers, these just unwind up to the executor and (at least in the case of Tokio) kill the executors thread. During the unwinding the TCP connection is dropped and the client gets something like The connection was reset. The application as a whole just keeps running (I'm not sure what happens with other requests currently being handled on the same worker thread).

My first thought on dealing with this is having a middleware that uses catch_unwind to catch the panic and return a minimal 500 to the client. (I have prototyped this and it works very easily).

The other part of this is that when I encountered it I had locked a Mutex stored in the State, this subsequently started returning PoisonError on access so my application was effectively dead. Maybe when a panic is encountered it should be proxied through to the future returned by serve and resumed on the main thread to try and kill the entire application. Alternatively this could be dealt with at the application level using "health checks" with support from a service coordinator (like Kubernetes). Since it depends on usecase it may make sense to provide middleware supporting different options and allow the user to choose.

@yoshuawuyts
Copy link
Member

@Nemo157 I like the idea of doing a basic thing where we terminate all connections with a 500. I'm a big fan of the panic = abort model, where if we're panicking it means something has gone terminally wrong, and we need to reboot.

I'm a bit hesitant for any more elaborate schemes though. Especially if there's a risk we might hang while we try and return the 500s (e.g. network timeouts). When a panic occurs we should want to terminate, and let the supervisor restart the application. If panics occur frequently supervisors (e.g. kubernetes) can issue an auto-rollback to give developers time to investigate the failure without it affecting production.

Re: poison errors -- parking-lot provides a way around poison errors specifically, but I'm not sure what to do about {dead,live}locks in general. I think your intuition is right, and we should have a way to provide status checks to any supervisor process so they can judge the status. But this indeed feels like a different topic than "dealing with panics". Perhaps we should open a separate issue on this?

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 27, 2019

I brought up the poison error because that is a knockon effect from panic unwinding. I don't know if there are other commonly used data structures that you might store in State which become unusable during unwinding, but if there are then it seems like a good argument for the "proxy panics through to App::{serve,run}" approach where a panic in any handler kills off the entire application (which can't be dealt with just as a middleware currently). It sounds like you're in favour of this approach rather than the basic "panic fails the current request, but the rest keep running" approach I prototyped.

(Opened #264 for more specifics on {dead,live}locked applications).

@prasannavl
Copy link
Contributor

prasannavl commented May 27, 2019

@Nemo157 - Are we talking about panics below the tide layer? (as in hyper, networking layers, etc), or above tide layers? (I think below tide they are fairly self contained, and represent more serious ones which probably should and would crash).

One of the things I've been experimenting with for quite some time is to have middlewares (and basically everything) return Result instead of Response. In this case, the very first middleware is the error handler - where the default handler can be opt out -- and this default handler can simply handle the panics and return 500s. I personally think we should cover all surface area of tide with Result, with the exception of language interop and unsafe code, it pretty much becomes infallible in most general cases. Deadlocked applications will likely timeout, or return 500s, which is probably what we want for most scenarios - not crashing the entire server by surprise edge cases.

With this ideology, someone who wants different behaviour can easily handle panics manually in higher middleware layers, or even simply swap out the default error handler middleware to their own liking.

Short version - I don't think we should be opinionated here at all, other than providing a default behaviour as a part of the middleware stack that's easily replace-able - this is application specific domain, and I don't think tide should interfere.

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 27, 2019

Above tide, in middleware or endpoints. In my case I was part way through implementing a feature so hit an unimplemented!() inside one of my endpoints.

@prasannavl
Copy link
Contributor

prasannavl commented May 27, 2019

@Nemo157 - I think a strategy similar to ASP .NET Core, or most javascript frameworks is what I'd proposing here. Have multiple preset middlewares in the beginning - One exception handler (which would become the panic handler doing catch_unwind here as exceptional cases), and one general error handler after that which handles Err part of Result<T, E>.

Also if there's a way to get the panic profile of the compilation during runtime, we can simply use that and preset the handler when the panic mode is set to unwind and leave out the handler to crash the application when it's set to abort - I mean if set to abort, they are likely going to crash anyway, but my point is doing it in layered middlewares seem like the best way to handle it.

Most applications and supervisor (kubernetes, swarm) do health checks anyway, that detect 500s and restart the application. Applications could be running other logic that's not just a pure server, or even proper cleanups to shutdown the application in case of multiple failures. So, I don't think tide should crash the entire application, unless that's the behaviour chosen by the application.

Personally, when I'm on a server loop - I except the server to never crash other than very serious memory allocation, or corruption errors. Here in your example instance, granted State is corrupted - but it's still upto the application to determine how much of the application logic is affected by this. For instance there could be completely irrelevant endpoint that does other things that don't depend on the state, or even one that does - for instance, an endpoint I use to remotely examine the corrupted state. So, I think this always has to be the application's choice - if we let the server crash, this effectively no longer becomes a choice, really.

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 27, 2019

I can open a PR for my middleware that transforms panics into 500s (which can easily change to just return the panic as an error if middleware starts returning results) and take a look at whether it's feasible to implement a middleware that stops the application (I have an idea of how to do it as a side-channel so that it will work with tide::serve without having to change tide::App at all).

@prasannavl
Copy link
Contributor

prasannavl commented May 27, 2019

@Nemo157 - do we really need to stop the application? I mean, isn't this logic best left at one place -- the place being the future's runtime? Tokio threadpool for instance has a panic handler, (and the current thread runtime bubbles up if I'm right?) and I suspect runtime will likely need that eventually as well. Won't they do the job of bubbling up and exiting the application, when required so?

Should this decision making be going into tide - unless of course it's purely contained in the middleware layer - in which case any and all things can be explored at will.

@Nemo157
Copy link
Contributor Author

Nemo157 commented May 27, 2019

I want to see exactly what is possible in this space with the current design. Whether there are any issues with proxying a panic back from a middleware into the context in which the tide::App is being run, and whether there is some way to attempt a clean shutdown of the tide::App which allows all outstanding requests to complete (also useful for better SIGINT/SIGTERM handling, after quick testing it appears that ^C currently just closes the socket if the handler is blocked).

My goal is just to make sure that there is some way to implement any sane behaviour that a user might want, not to change what Tide itself does by default.

@prasannavl
Copy link
Contributor

prasannavl commented May 27, 2019

Ah, that sounds pretty cool! I'm keen to explore this as well.

Currently I do ugly workaround for clean shutdown with a custom hyper backend using hyper's graceful shutdown channel - by manually connecting them to SIGINT/SIGTERM. But one other alternative I was thinking of experimenting with instead of http-rs/http-service#11 was to do that in one of the first middleware - makes it agnostic of the backend or http-service but will end up with redundant checks on the hotpath.

@Nemo157
Copy link
Contributor Author

Nemo157 commented Nov 11, 2019

Random thought I just had related to this:

Rather than internally attempting to handle panics and return a 500 Tide (or more likely the default http-service implementation) could transparently implement a multi-process model, the initial process binds the port and spawns an inner process it proxies all requests to, when the inner process dies the outer process closes the port, returns a 500 on all outstanding requests and dies as well. That would allow cleanly responding to application level panics, even with panic = abort, and just leave panics in the http layers used by the outer process unhandled.

@prasannavl
Copy link
Contributor

prasannavl commented Nov 11, 2019

@Nemo157 - Sounds like a good case for an additional model to experiment with, rather than a default replacement for what do when a panic happens in the current process, current thread. If the handling can abstracted to the point where both are provided in the box and either of these can be chosen as easily as a simple middleware add, I think that's a good story for tide.

However focusing in on one may allow additional optimizations like handing over the socket instead of proxy-ing (or SO_REUSEADDR/SO_REUSEPORT can help in OSes where applicable, may be? don't recall the precise behaviour), but I'd still like to have an in-process story.

@dzrtc
Copy link

dzrtc commented Jan 1, 2020

Has the tide-panic middleware been lost? It's not in the middleware folder.

@rw
Copy link

rw commented Mar 12, 2020

Where is tide-panic? :-)

@drewish
Copy link

drewish commented Mar 16, 2020

Looks like it got unwound in edadcc3 #336 has some context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants