Skip to content

Grafana queries cause workers to stuck #21341

Closed
@davseitsev

Description

@davseitsev

After upgrade from Trino 409 to 443 queries from Grafana started to kill Trino cluster.

How it happens. I open a dashboard with about 20 graps, refresh it 5-6 times. The number of queries on the cluster from grafana user is growing until number of Trino workers start to decrease. Total running queries reaches 70-80.
Then queries start to fail with PAGE_TRANSPORT_TIMEOUT, PAGE_TRANSPORT_ERROR or TOO_MANY_REQUESTS_FAILED.

Disappeared workers do not die, they just stop accepting any HTTP calls.
In the worker logs I can see a lot of errors like this:

WARN    node-state-poller-0     io.trino.metadata.RemoteNodeState       Node state update request to http://10.42.109.99:8181/v1/info/state has not returned in 10.01s
ERROR   page-buffer-client-callback-9   io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16 failed java.io.UncheckedIOException: Server refused connection: http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16
ERROR   page-buffer-client-callback-9 Request to delete http://10.42.108.117:8181/v1/task/20240401_095603_00639_k6uq3.3.26.0/results/4 failed java.util.concurrent.CancellationException: Task was cancelled
ERROR   page-buffer-client-callback-23 Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
ERROR   page-buffer-client-callback-13  io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
WARN    http-client-node-manager-3486   io.trino.metadata.RemoteNodeState       Error fetching node state from http://10.42.109.99:8181/v1/info/state: Server refused connection: http://10.42.109.99:8181/v1/info/state
INFO    node-state-poller-0     io.trino.metadata.DiscoveryNodeManager  Previously active node is missing: 7cbae37347d4345ad65671c7ee38b470436e5ea12b581d63cee993f451eab984 (last seen at 10.42.109.99)

10.42.109.99 is an IP of the hanging node.
When I run curl http://10.42.109.99:8181/v1/info/state it never finishes.

Thread dump shows that almost all threads in http-worker pool are in WAITING state. Thread dump is attached:
trino-worker-D0328-T1236-20978.tdump.txt.gz.
At the same time, heap dump shows that http-worker pool contains over 3M tasks:
image

As HTTP calls never finish, number of open connections is growing:
image

Query example:

WITH    accs AS (
    SELECT  aid
            , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS ticket_number
            , ARBITRARY(ticket_id)                        AS ticket_id
            , MAX_BY(decision, created_date)              AS decision
            , MAX_BY(review_reason, created_date)         AS review_reason
    FROM    iceberg.schema.ticket
    WHERE   source = '....'
        AND status = 'Closed'
    GROUP BY    1
)

, communications AS (
    SELECT  DISTINCT ticket_id
    FROM    iceberg.schema.ticket_history
    WHERE   ticket_id IN (SELECT ticket_id FROM accs)
        AND field = 'Status'
        AND new_value = 'Communication'
)

SELECT  COALESCE(review_reason, 'No review reason') AS review_reason
        , ticket WHEN ticket_id IN (SELECT ticket_id FROM communications) THEN 'Had Communication + '
            ELSE 'No communication + ' END
            || decision AS decision
        , COUNT(aid) AS vol
        , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS tickets
FROM    accs
GROUP BY    1, 2
ORDER BY    3 DESC, 1, 2

But each graph sends it's own query, not sure I need to provide all of them.

We use the latest version of grafana plugin (1.0.6): https://github.com/trinodb/grafana-trino

Let me know if there is any additional information I can provide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions