Skip to content

Grafana queries cause workers to stuck #21341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
davseitsev opened this issue Apr 1, 2024 · 6 comments
Open

Grafana queries cause workers to stuck #21341

davseitsev opened this issue Apr 1, 2024 · 6 comments

Comments

@davseitsev
Copy link

After upgrade from Trino 409 to 443 queries from Grafana started to kill Trino cluster.

How it happens. I open a dashboard with about 20 graps, refresh it 5-6 times. The number of queries on the cluster from grafana user is growing until number of Trino workers start to decrease. Total running queries reaches 70-80.
Then queries start to fail with PAGE_TRANSPORT_TIMEOUT, PAGE_TRANSPORT_ERROR or TOO_MANY_REQUESTS_FAILED.

Disappeared workers do not die, they just stop accepting any HTTP calls.
In the worker logs I can see a lot of errors like this:

WARN    node-state-poller-0     io.trino.metadata.RemoteNodeState       Node state update request to http://10.42.109.99:8181/v1/info/state has not returned in 10.01s
ERROR   page-buffer-client-callback-9   io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16 failed java.io.UncheckedIOException: Server refused connection: http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16
ERROR   page-buffer-client-callback-9 Request to delete http://10.42.108.117:8181/v1/task/20240401_095603_00639_k6uq3.3.26.0/results/4 failed java.util.concurrent.CancellationException: Task was cancelled
ERROR   page-buffer-client-callback-23 Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
ERROR   page-buffer-client-callback-13  io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
WARN    http-client-node-manager-3486   io.trino.metadata.RemoteNodeState       Error fetching node state from http://10.42.109.99:8181/v1/info/state: Server refused connection: http://10.42.109.99:8181/v1/info/state
INFO    node-state-poller-0     io.trino.metadata.DiscoveryNodeManager  Previously active node is missing: 7cbae37347d4345ad65671c7ee38b470436e5ea12b581d63cee993f451eab984 (last seen at 10.42.109.99)

10.42.109.99 is an IP of the hanging node.
When I run curl http://10.42.109.99:8181/v1/info/state it never finishes.

Thread dump shows that almost all threads in http-worker pool are in WAITING state. Thread dump is attached:
trino-worker-D0328-T1236-20978.tdump.txt.gz.
At the same time, heap dump shows that http-worker pool contains over 3M tasks:
image

As HTTP calls never finish, number of open connections is growing:
image

Query example:

WITH    accs AS (
    SELECT  aid
            , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS ticket_number
            , ARBITRARY(ticket_id)                        AS ticket_id
            , MAX_BY(decision, created_date)              AS decision
            , MAX_BY(review_reason, created_date)         AS review_reason
    FROM    iceberg.schema.ticket
    WHERE   source = '....'
        AND status = 'Closed'
    GROUP BY    1
)

, communications AS (
    SELECT  DISTINCT ticket_id
    FROM    iceberg.schema.ticket_history
    WHERE   ticket_id IN (SELECT ticket_id FROM accs)
        AND field = 'Status'
        AND new_value = 'Communication'
)

SELECT  COALESCE(review_reason, 'No review reason') AS review_reason
        , ticket WHEN ticket_id IN (SELECT ticket_id FROM communications) THEN 'Had Communication + '
            ELSE 'No communication + ' END
            || decision AS decision
        , COUNT(aid) AS vol
        , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS tickets
FROM    accs
GROUP BY    1, 2
ORDER BY    3 DESC, 1, 2

But each graph sends it's own query, not sure I need to provide all of them.

We use the latest version of grafana plugin (1.0.6): https://github.com/trinodb/grafana-trino

Let me know if there is any additional information I can provide.

@davseitsev
Copy link
Author

Vey likely the issue is cased by jackson serializer, it's described here #21356.
After changing recycler pool implementation we can't reproduce it.

@Akanksha-kedia
Copy link

@davseitsev The proposed solution involves modifying the HTTPServer code to introduce a timeout if a thread is in idle state for more than x minutes. This modification is done in a new module with the HTTPServer class from a specific version of the simple client http server used by the jmx prometheus java agent.?

@Akanksha-kedia
Copy link

i would like to work on this.

@davseitsev
Copy link
Author

It might be related to this issue #21512.

@findepi
Copy link
Member

findepi commented Apr 11, 2024

It might be related to this issue #21512.

the other issue mentions a workaround (experimental.thread-per-driver-scheduler-enabled=false)
@davseitsev Once that is set, do you still observe the problem with workers being killed/stuck/unresponsive when you run your Grafana dashboard?

@davseitsev
Copy link
Author

The fix in #21356 significantly improves the situation but it's still reproduced. Accoring to latest comments, there are still some cases when we create json factory with LockFreePool, so the fix in version 444 doesn't cover all the cases and it could be the reason.
We've applied additional fix for #21356 and the workaround from #21512 and we can't reproduce the issue anymore.
We were testing it during 2 days on our production env and it looks stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants