Description
After upgrade from Trino 409 to 443 queries from Grafana started to kill Trino cluster.
How it happens. I open a dashboard with about 20 graps, refresh it 5-6 times. The number of queries on the cluster from grafana user is growing until number of Trino workers start to decrease. Total running queries reaches 70-80.
Then queries start to fail with PAGE_TRANSPORT_TIMEOUT
, PAGE_TRANSPORT_ERROR
or TOO_MANY_REQUESTS_FAILED
.
Disappeared workers do not die, they just stop accepting any HTTP calls.
In the worker logs I can see a lot of errors like this:
WARN node-state-poller-0 io.trino.metadata.RemoteNodeState Node state update request to http://10.42.109.99:8181/v1/info/state has not returned in 10.01s
ERROR page-buffer-client-callback-9 io.trino.operator.HttpPageBufferClient Request to delete http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16 failed java.io.UncheckedIOException: Server refused connection: http://10.42.109.99:8181/v1/task/20240401_095909_00959_k6uq3.6.7.0/results/16
ERROR page-buffer-client-callback-9 Request to delete http://10.42.108.117:8181/v1/task/20240401_095603_00639_k6uq3.3.26.0/results/4 failed java.util.concurrent.CancellationException: Task was cancelled
ERROR page-buffer-client-callback-23 Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
ERROR page-buffer-client-callback-13 io.trino.operator.HttpPageBufferClient Request to delete http://10.42.109.99:8181/v1/task/20240401_095603_00639_k6uq3.2.45.0/results/48 failed java.lang.RuntimeException: java.util.concurrent.TimeoutException: Idle timeout expired: 35000/35000 ms
WARN http-client-node-manager-3486 io.trino.metadata.RemoteNodeState Error fetching node state from http://10.42.109.99:8181/v1/info/state: Server refused connection: http://10.42.109.99:8181/v1/info/state
INFO node-state-poller-0 io.trino.metadata.DiscoveryNodeManager Previously active node is missing: 7cbae37347d4345ad65671c7ee38b470436e5ea12b581d63cee993f451eab984 (last seen at 10.42.109.99)
10.42.109.99
is an IP of the hanging node.
When I run curl http://10.42.109.99:8181/v1/info/state
it never finishes.
Thread dump shows that almost all threads in http-worker pool are in WAITING state. Thread dump is attached:
trino-worker-D0328-T1236-20978.tdump.txt.gz.
At the same time, heap dump shows that http-worker pool contains over 3M tasks:
As HTTP calls never finish, number of open connections is growing:
Query example:
WITH accs AS (
SELECT aid
, ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ') AS ticket_number
, ARBITRARY(ticket_id) AS ticket_id
, MAX_BY(decision, created_date) AS decision
, MAX_BY(review_reason, created_date) AS review_reason
FROM iceberg.schema.ticket
WHERE source = '....'
AND status = 'Closed'
GROUP BY 1
)
, communications AS (
SELECT DISTINCT ticket_id
FROM iceberg.schema.ticket_history
WHERE ticket_id IN (SELECT ticket_id FROM accs)
AND field = 'Status'
AND new_value = 'Communication'
)
SELECT COALESCE(review_reason, 'No review reason') AS review_reason
, ticket WHEN ticket_id IN (SELECT ticket_id FROM communications) THEN 'Had Communication + '
ELSE 'No communication + ' END
|| decision AS decision
, COUNT(aid) AS vol
, ARRAY_JOIN(ARRAY_AGG(ticket_number), ', ') AS tickets
FROM accs
GROUP BY 1, 2
ORDER BY 3 DESC, 1, 2
But each graph sends it's own query, not sure I need to provide all of them.
We use the latest version of grafana plugin (1.0.6): https://github.com/trinodb/grafana-trino
Let me know if there is any additional information I can provide.