-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Grafana queries cause workers to stuck #21341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Vey likely the issue is cased by jackson serializer, it's described here #21356. |
@davseitsev The proposed solution involves modifying the HTTPServer code to introduce a timeout if a thread is in idle state for more than x minutes. This modification is done in a new module with the HTTPServer class from a specific version of the simple client http server used by the jmx prometheus java agent.? |
i would like to work on this. |
It might be related to this issue #21512. |
the other issue mentions a workaround ( |
The fix in #21356 significantly improves the situation but it's still reproduced. Accoring to latest comments, there are still some cases when we create json factory with LockFreePool, so the fix in version 444 doesn't cover all the cases and it could be the reason. |
After upgrade from Trino 409 to 443 queries from Grafana started to kill Trino cluster.
How it happens. I open a dashboard with about 20 graps, refresh it 5-6 times. The number of queries on the cluster from grafana user is growing until number of Trino workers start to decrease. Total running queries reaches 70-80.
Then queries start to fail with
PAGE_TRANSPORT_TIMEOUT
,PAGE_TRANSPORT_ERROR
orTOO_MANY_REQUESTS_FAILED
.Disappeared workers do not die, they just stop accepting any HTTP calls.
In the worker logs I can see a lot of errors like this:
10.42.109.99
is an IP of the hanging node.When I run
curl http://10.42.109.99:8181/v1/info/state
it never finishes.Thread dump shows that almost all threads in http-worker pool are in WAITING state. Thread dump is attached:

trino-worker-D0328-T1236-20978.tdump.txt.gz.
At the same time, heap dump shows that http-worker pool contains over 3M tasks:
As HTTP calls never finish, number of open connections is growing:

Query example:
But each graph sends it's own query, not sure I need to provide all of them.
We use the latest version of grafana plugin (1.0.6): https://github.com/trinodb/grafana-trino
Let me know if there is any additional information I can provide.
The text was updated successfully, but these errors were encountered: