Skip to content

Celery worker fails to serve logs when using gevent pool type #60144

@jrod9k1

Description

@jrod9k1

Apache Airflow version

3.1.5

If "Other Airflow 3 version" selected, which one?

No response

What happened?

When configuring a celery worker to use the "gevent" pool type, task logs in the frontend UI for task runs will fail to return instead erroring with

Log message source details sources=["Could not read served logs: HTTPConnectionPool(host='{workername}', port=8793): Read timed out. (read timeout=5)"]

Seems to have been caused by moving the worker's log functionality to FastAPI in 3.1.x. Gevent's monkey patching breaks async behavior there. Links/details below.

What you think should happen instead?

Logs should successfully return when visiting a task run in the UI.

How to reproduce

  1. Setup an Airflow site with a single celery worker
  2. Configure the worker to use the gevent pool type
[celery]
# ... snip ...
pool = gevent
  1. Run a task on the worker
  2. Attempt to view the log for the task run in the UI

Operating System

SUSE Linux Enterprise Server 15 SP5

Versions of Apache Airflow Providers

apache-airflow-providers-apache-kafka 1.11.0 pypi_0 pypi
apache-airflow-providers-atlassian-jira 3.3.0 pypi_0 pypi
apache-airflow-providers-celery 3.14.0 pypi_0 pypi
apache-airflow-providers-cncf-kubernetes 10.11.0 pypi_0 pypi
apache-airflow-providers-common-compat 1.10.0 pypi_0 pypi
apache-airflow-providers-common-io 1.7.0 pypi_0 pypi
apache-airflow-providers-common-sql 1.30.0 pypi_0 pypi
apache-airflow-providers-docker 4.5.0 pypi_0 pypi
apache-airflow-providers-elasticsearch 6.4.0 pypi_0 pypi
apache-airflow-providers-fab 3.0.3 pypi_0 pypi
apache-airflow-providers-http 5.6.0 pypi_0 pypi
apache-airflow-providers-influxdb 2.10.0 pypi_0 pypi
apache-airflow-providers-microsoft-mssql 4.4.0 pypi_0 pypi
apache-airflow-providers-microsoft-winrm 3.13.0 pypi_0 pypi
apache-airflow-providers-mysql 6.4.0 pypi_0 pypi
apache-airflow-providers-openlineage 2.9.0 pypi_0 pypi
apache-airflow-providers-opensearch 1.8.0 pypi_0 pypi
apache-airflow-providers-opsgenie 5.10.0 pypi_0 pypi
apache-airflow-providers-postgres 6.5.0 pypi_0 pypi
apache-airflow-providers-smtp 2.4.0 pypi_0 pypi
apache-airflow-providers-ssh 4.2.0 pypi_0 pypi
apache-airflow-providers-standard 1.10.0 pypi_0 pypi

Deployment

Other

Deployment details

We deploy to VMs using a custom built conda-pack'd virtual environment. Deployment is performed using Ansible. Installed a secure environment without internet access.

Anything else?

  • I'd be willing to submit a PR if this is within my capabilities, though this feels like a larger architectural decision beyond the reach of a casual PR?
  • This bug can also be reproduced by setting the env var _AIRFLOW_PATCH_GEVENT=1 as it also triggers monkey patching logic

I chased it for a bit and this bug seems to be an issue in Airflow 3.1+ since the worker log functionality was moved over to FastAPI #52581

When you flip the celery worker to the gevent pool, it triggers monkey patching that patches over core modules https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262

This seems to cause issues with async logic in FastAPI that is a known incompatibility fastapi/fastapi#6395

I confirmed with wireshark and strace in our Airflow site that the api-server made the call for the log, it was received, TCP ack'd, and the worker stats the correct log file on disk before pinging an eventfd and going out to lunch. I am guessing this is where some FastAPI async logic under the hood blows up, ie the bug

[pid 47886] lstat("../logs/airflow/base", {st_mode=S_IFDIR|0740, st_size=62, ...}) = 0
[pid 47886] stat(".../logs/airflow/base/dag_id=tutorial_dag/run_id=manual__2025-12-11T16:46:31+00:00/task_id=extract/attempt=1.log", {st_mode=S_IFREG|0644, st_size=656, ...}) = 0
[pid 47886] mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4371b3a000
[pid 47886] gettid()                    = 47886
[pid 47886] epoll_pwait(11, [{events=EPOLLIN, data={u32=18, u64=18}}], 1024, 0, NULL, 8) = 1
[pid 47886] read(18, "\1\0\0\0\0\0\0\0", 1024) = 8
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0

Largely irrelevant for Airflow but the core issue seems to be due to monkey patching of both the "queue" and "thread" modules. Doing a quick and dirty patch of the celery module in a running Airflow instance like this causes the bug to disappear

lib/python3.12/site-packages/celery/__init__.py

gevent.monkey.patch_all()

to

gevent.monkey.patch_all(queue=False, thread=False)

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions