Grafana queries cause workers to stuck

After upgrade from Trino 409 to 443 queries from Grafana started to kill Trino cluster.

How it happens. I open a dashboard with about 20 graps, refresh it 5-6 times. The number of queries on the cluster from **grafana** user is growing until number of Trino workers start to decrease. Total running queries reaches 70-80.
Then queries start to fail with `PAGE_TRANSPORT_TIMEOUT`, `PAGE_TRANSPORT_ERROR` or `TOO_MANY_REQUESTS_FAILED`.

Disappeared workers do not die, they just stop accepting any HTTP calls.
In the worker logs I can see a lot of errors like this:
```
WARN    node-state-poller-0     io.trino.metadata.RemoteNodeState       Node state update request to http://10.42.109.99:8181/v1/info/state has not returned in 10.01s
ERROR   page-buffer-client-callback-9   io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16 failed java.io.UncheckedIOException: Server refused connection: http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16
ERROR   page-buffer-client-callback-9 Request to delete http://10.42.108.117:8181/v1/task/20240401_095603_00639_k6uq3.3.26.0/results/4 failed java.util.concurrent.CancellationException: Task was cancelled
ERROR   page-buffer-client-callback-23 Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
ERROR   page-buffer-client-callback-13  io.trino.operator.HttpPageBufferClient  Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
WARN    http-client-node-manager-3486   io.trino.metadata.RemoteNodeState       Error fetching node state from http://10.42.109.99:8181/v1/info/state: Server refused connection: http://10.42.109.99:8181/v1/info/state
INFO    node-state-poller-0     io.trino.metadata.DiscoveryNodeManager  Previously active node is missing: 7cbae37347d4345ad65671c7ee38b470436e5ea12b581d63cee993f451eab984 (last seen at 10.42.109.99)
```
`10.42.109.99` is an IP of the hanging node.
When I run `curl http://10.42.109.99:8181/v1/info/state` it never finishes.

Thread dump shows that almost all threads in **http-worker** pool are in WAITING state. Thread dump is attached:
[trino-worker-D0328-T1236-20978.tdump.txt.gz](https://github.com/trinodb/trino/files/14822803/trino-worker-D0328-T1236-20978.tdump.txt.gz).
At the same time, heap dump shows that **http-worker** pool contains over 3M tasks:
<img width="1028" alt="image" src="https://github.com/trinodb/trino/assets/1793410/a6db13d2-852a-42ee-9308-efd89bc40c76">

As HTTP calls never finish, number of open connections is growing:
<img width="1132" alt="image" src="https://github.com/trinodb/trino/assets/1793410/113b8cb2-334e-49ef-95f7-5b7b3b3333a4">

Query example:
```sql
WITH    accs AS (
    SELECT  aid
            , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS ticket_number
            , ARBITRARY(ticket_id)                        AS ticket_id
            , MAX_BY(decision, created_date)              AS decision
            , MAX_BY(review_reason, created_date)         AS review_reason
    FROM    iceberg.schema.ticket
    WHERE   source = '....'
        AND status = 'Closed'
    GROUP BY    1
)

, communications AS (
    SELECT  DISTINCT ticket_id
    FROM    iceberg.schema.ticket_history
    WHERE   ticket_id IN (SELECT ticket_id FROM accs)
        AND field = 'Status'
        AND new_value = 'Communication'
)

SELECT  COALESCE(review_reason, 'No review reason') AS review_reason
        , ticket WHEN ticket_id IN (SELECT ticket_id FROM communications) THEN 'Had Communication + '
            ELSE 'No communication + ' END
            || decision AS decision
        , COUNT(aid) AS vol
        , ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ')  AS tickets
FROM    accs
GROUP BY    1, 2
ORDER BY    3 DESC, 1, 2
```
But each graph sends it's own query, not sure I need to provide all of them.

We use the latest version of grafana plugin (**1.0.6**): https://github.com/trinodb/grafana-trino

Let me know if there is any additional information I can provide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grafana queries cause workers to stuck #21341

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grafana queries cause workers to stuck #21341

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions